Re: Handling stale PRs

2014-08-26 Thread Patrick Wendell
Hey Nicholas, Thanks for bringing this up. There are a few dimensions to this... one is that it's actually precedurally difficult for us to close pull requests. I've proposed several different solutions to ASF infra to streamline the process, but thus far they haven't been open to any of my

Re: [Spark SQL] off-heap columnar store

2014-08-26 Thread Evan Chan
What would be the timeline for the parquet caching work? The reason I'm asking about the columnar compressed format is that there are some problems for which Parquet is not practical. On Mon, Aug 25, 2014 at 1:13 PM, Michael Armbrust mich...@databricks.com wrote: What is the plan for getting

Re: [Spark SQL] off-heap columnar store

2014-08-26 Thread Michael Armbrust
Any initial proposal or design about the caching to Tachyon that you can share so far? Caching parquet files in tachyon with saveAsParquetFile and then reading them with parquetFile should already work. You can use SQL on these tables by using registerTempTable. Some of the general parquet

CoHadoop Papers

2014-08-26 Thread Gary Malouf
One of my colleagues has been questioning me as to why Spark/HDFS makes no attempts to try to co-locate related data blocks. He pointed to this paper: http://www.vldb.org/pvldb/vol4/p575-eltabakh.pdf from 2011 on the CoHadoop research and the performance improvements it yielded for Map/Reduce

Re: Handling stale PRs

2014-08-26 Thread Matthew Farrellee
On 08/26/2014 04:57 AM, Sean Owen wrote: On Tue, Aug 26, 2014 at 7:02 AM, Patrick Wendell pwend...@gmail.com wrote: Most other ASF projects I know just ignore these patches. I'd prefer if we Agree, this drives me crazy. It kills part of JIRA's usefulness. Spark is blessed/cursed with

Re: CoHadoop Papers

2014-08-26 Thread Gary Malouf
It appears support for this type of control over block placement is going out in the next version of HDFS: https://issues.apache.org/jira/browse/HDFS-2576 On Tue, Aug 26, 2014 at 7:43 AM, Gary Malouf malouf.g...@gmail.com wrote: One of my colleagues has been questioning me as to why Spark/HDFS

Re: too many CancelledKeyException throwed from ConnectionManager

2014-08-26 Thread Kousuke Saruta
Hi Shengzhe I faced to same situation. I think, Connection and ConnectionManager have some race condition issues and the error you mentioned may be caused by the issues. Now I'm trying to resolve the issue in https://github.com/apache/spark/pull/2019. Please check it out. - Kousuke

Re: Handling stale PRs

2014-08-26 Thread Madhu
Sean Owen wrote Stale JIRAs are a symptom, not a problem per se. I also want to see the backlog cleared, but automatically closing doesn't help, if the problem is too many JIRAs and not enough committer-hours to look at them. Some noise gets closed, but some easy or important fixes may

Re: Handling stale PRs

2014-08-26 Thread Erik Erlandson
- Original Message - Another thing is that we should, IMO, err on the side of explicitly saying no or not yet to patches, rather than letting them linger without attention. We do get patches where the user is well intentioned, but it is Completely agree. The solution is partly

Re: CoHadoop Papers

2014-08-26 Thread Christopher Nguyen
Gary, do you mean Spark and HDFS separately, or Spark's use of HDFS? If the former, Spark does support copartitioning. If the latter, it's an HDFS scope that's outside of Spark. On that note, Hadoop does also make attempts to collocate data, e.g., rack awareness. I'm sure the paper makes useful

HiveContext, schemaRDD.printSchema get different dataTypes, feature or a bug? really strange and surprised...

2014-08-26 Thread chutium
is there any dataType auto convert or detect or something in HiveContext ?all columns of a table is defined as string in hive metastoreone column is total_price with values like 123.45, then this column will be recognized as dataType Float in HiveContext...this is a feature or a bug? it really

HiveContext, schemaRDD.printSchema get different dataTypes, feature or a bug? really strange and surprised...

2014-08-26 Thread chutium
is there any dataType auto convert or detect or something in HiveContext ? all columns of a table is defined as string in hive metastore one column is total_price with values like 123.45, then this column will be recognized as dataType Float in HiveContext... this is a feature or a bug? it

Re: Gradient descent and runMiniBatchSGD

2014-08-26 Thread RJ Nowling
Hi Alexander, Can you post a link to the code? RJ On Tue, Aug 26, 2014 at 6:53 AM, Ulanov, Alexander alexander.ula...@hp.com wrote: Hi, I've implemented back propagation algorithm using Gradient class and a simple update using Updater class. Then I run the algorithm with mllib's

Re: Handling stale PRs

2014-08-26 Thread Nicholas Chammas
On Tue, Aug 26, 2014 at 2:02 AM, Patrick Wendell pwend...@gmail.com wrote: I'd prefer if we took the approach of politely explaining why in the current form the patch isn't acceptable and closing it (potentially w/ tips on how to improve it or narrow the scope). Amen to this. Aiming for such

Re: HiveContext, schemaRDD.printSchema get different dataTypes, feature or a bug? really strange and surprised...

2014-08-26 Thread chutium
oops, i tried on a managed table, column types will not be changed so it is mostly due to the serde lib CSVSerDe (https://github.com/ogrodnek/csv-serde/blob/master/src/main/java/com/bizo/hive/serde/csv/CSVSerde.java#L123) or maybe CSVReader from opencsv?... but if the columns are defined as

Re: CoHadoop Papers

2014-08-26 Thread Gary Malouf
Christopher, can you expand on the co-partitioning support? We have a number of spark SQL tables (saved in parquet format) that all could be considered to have a common hash key. Our analytics team wants to do frequent joins across these different data-sets based on this key. It makes sense

Re: Handling stale PRs

2014-08-26 Thread Josh Rosen
Last weekend, I started hacking on a Google App Engine app for helping with pull request review (screenshot: http://i.imgur.com/wwpZKYZ.png).  Some of my basic goals (not all implemented yet): - Users sign in using GitHub and can browse a list of pull requests, including links to associated

Re: Handling stale PRs

2014-08-26 Thread Nicholas Chammas
OK, that sounds pretty cool. Josh, Do you see this app as encompassing or supplanting the functionality I described as well? Nick On Tue, Aug 26, 2014 at 2:21 PM, Josh Rosen rosenvi...@gmail.com wrote: Last weekend, I started hacking on a Google App Engine app for helping with pull request

Re: Handling stale PRs

2014-08-26 Thread Josh Rosen
Sure; App Engine supports cron and sending emails.  We can configure the app with Spark QA’s credentials in order to allow it to post comments on issues, etc. - Josh On August 26, 2014 at 11:38:08 AM, Nicholas Chammas (nicholas.cham...@gmail.com) wrote: OK, that sounds pretty cool. Josh,

Re: [SPARK-2878] Kryo serialisation with custom Kryo registrator failing

2014-08-26 Thread npanj
I have both SPARK-2878 and SPARK-2893. -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/SPARK-2878-Kryo-serialisation-with-custom-Kryo-registrator-failing-tp7719p8046.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

spark-ec2 1.0.2 creates EC2 cluster at wrong version

2014-08-26 Thread Nicholas Chammas
I downloaded the source code release for 1.0.2 from here http://spark.apache.org/downloads.html and launched an EC2 cluster using spark-ec2. After the cluster finishes launching, I fire up the shell and check the version: scala sc.version res1: String = 1.0.1 The startup banner also shows the

Re: spark-ec2 1.0.2 creates EC2 cluster at wrong version

2014-08-26 Thread Shivaram Venkataraman
This is a chicken and egg problem in some sense. We can't change the ec2 script till we have made the release and uploaded the binaries -- But once that is done, we can't update the script. I think the model we support so far is that you can launch the latest spark version from the master branch

Re: Gradient descent and runMiniBatchSGD

2014-08-26 Thread RJ Nowling
Xiangrui, I posted a note on my JIRA for MiniBatch KMeans about the same problem -- sampling running in O(n). Can you elaborate on ways to get more efficient sampling? I think this will be important for a variety of stochastic algorithms. RJ On Tue, Aug 26, 2014 at 12:54 PM, Xiangrui Meng

Re: Gradient descent and runMiniBatchSGD

2014-08-26 Thread RJ Nowling
Also, another idea: may algorithms that use sampling tend to do so multiple times. It may be beneficial to allow a transformation to a representation that is more efficient for multiple rounds of sampling. On Tue, Aug 26, 2014 at 4:36 PM, RJ Nowling rnowl...@gmail.com wrote: Xiangrui, I

Re: Gradient descent and runMiniBatchSGD

2014-08-26 Thread Ulanov, Alexander
Hi Xiangrui, Thanks for explanation, but I'm still missing something. In my experiments, if miniBatchFraction == 1.0, no matter how the data is partitioned (2, 4, 8, 16 partitions), the algorithm executes more or less in the same time. (I have 16 Workers). Reduce from runMiniBatchSGD takes

Re: Handling stale PRs

2014-08-26 Thread Nicholas Chammas
By the way, as a reference point, I just stumbled across the Discourse GitHub project and their list of pull requests https://github.com/discourse/discourse/pulls looks pretty neat. ~2,200 closed PRs, 6 open. Least recently updated PR dates to 8 days ago. Project started ~1.5 years ago. Dunno

Re: CoHadoop Papers

2014-08-26 Thread Michael Armbrust
It seems like there are two things here: - Co-locating blocks with the same keys to avoid network transfer. - Leveraging partitioning information to avoid a shuffle when data is already partitioned correctly (even if those partitions aren't yet on the same machine). The former seems more

Re: CoHadoop Papers

2014-08-26 Thread Gary Malouf
Hi Michael, I think once that work is into HDFS, it will be great to expose this functionality via Spark. This is something worth pursuing because it could grant orders of magnitude perf improvements in cases when people need to join data. The second item would be very interesting, could yield

OutOfMemoryError when running sbt/sbt test

2014-08-26 Thread jay vyas
Hi spark. I've been trying to build spark, but I've been getting lots of oome exceptions. https://gist.github.com/jayunit100/d424b6b825ce8517d68c For the most part, they are of the form: java.lang.OutOfMemoryError: unable to create new native thread I've attempted to hard code the

Re: OutOfMemoryError when running sbt/sbt test

2014-08-26 Thread Mubarak Seyed
What is your ulimit value? On Tue, Aug 26, 2014 at 5:49 PM, jay vyas jayunit100.apa...@gmail.com wrote: Hi spark. I've been trying to build spark, but I've been getting lots of oome exceptions. https://gist.github.com/jayunit100/d424b6b825ce8517d68c For the most part, they are of the

Re: spark-ec2 1.0.2 creates EC2 cluster at wrong version

2014-08-26 Thread Matei Zaharia
This shouldn't be a chicken-and-egg problem, since the script fetches the AMI from a known URL. Seems like an issue in publishing this release. On August 26, 2014 at 1:24:45 PM, Shivaram Venkataraman (shiva...@eecs.berkeley.edu) wrote: This is a chicken and egg problem in some sense. We can't

Re: OutOfMemoryError when running sbt/sbt test

2014-08-26 Thread Jay Vyas
Thanks...! Some questions below. 1) you are suggesting that maybe this OOME is a symptom/red herring , and the true cause of it is that a thread can't span because of ulimit... If so possibly this could be flagged early on in the build. And -- where are so many threads coming from that I need

Re: OutOfMemoryError when running sbt/sbt test

2014-08-26 Thread Anand Avati
Hi Jay, The recommended way to build spark from source is through the maven system. You would want to follow the steps in https://spark.apache.org/docs/latest/building-with-maven.html to set the MAVEN_OPTS to prevent OOM build errors. Thanks On Tue, Aug 26, 2014 at 5:49 PM, jay vyas

Re: Handling stale PRs

2014-08-26 Thread Madhu
Nicholas Chammas wrote Dunno how many committers Discourse has, but it looks like they've managed their PRs well. I hope we can do as well in this regard as they have. Discourse developers appear to eat their own dog food https://meta.discourse.org . Improved collaboration and a shared vision

Re: spark-ec2 1.0.2 creates EC2 cluster at wrong version

2014-08-26 Thread Tathagata Das
Yes, this was an oversight on my part. I have opened a JIRA for this. https://issues.apache.org/jira/browse/SPARK-3242 For the time being the workaround should be providing the version 1.0.2 explicitly as part of the script. TD On Tue, Aug 26, 2014 at 6:39 PM, Matei Zaharia