Re: Handling stale PRs

2014-08-26 Thread Patrick Wendell
Hey Nicholas,

Thanks for bringing this up. There are a few dimensions to this... one is
that it's actually precedurally difficult for us to close pull requests.
I've proposed several different solutions to ASF infra to streamline the
process, but thus far they haven't been open to any of my ideas:

https://issues.apache.org/jira/browse/INFRA-7918
https://issues.apache.org/jira/browse/INFRA-8241

The more important thing, maybe, is how we want to deal with this
culturally. And I think we need to do a better job of making sure no pull
requests go unattended (i.e. waiting for committer feedback). If patches go
stale, it should be because the user hasn't responded, not us.

Another thing is that we should, IMO, err on the side of explicitly saying
no or not yet to patches, rather than letting them linger without
attention. We do get patches where the user is well intentioned, but it is
a feature that doesn't make sense to add, or isn't well thought out or
explained, or the review effort would be so large it's not within our
capacity to look at just yet.

Most other ASF projects I know just ignore these patches. I'd prefer if we
took the approach of politely explaining why in the current form the patch
isn't acceptable and closing it (potentially w/ tips on how to improve it
or narrow the scope).

- Patrick




On Mon, Aug 25, 2014 at 9:57 PM, Matei Zaharia matei.zaha...@gmail.com
wrote:

 Hey Nicholas,

 In general we've been looking at these periodically (at least I have) and
 asking people to close out of date ones, but it's true that the list has
 gotten fairly large. We should probably have an expiry time of a few months
 and close them automatically. I agree that it's daunting to see so many
 open PRs.

 Matei

 On August 25, 2014 at 9:03:09 PM, Nicholas Chammas (
 nicholas.cham...@gmail.com) wrote:

 Check this out:

 https://github.com/apache/spark/pulls?q=is%3Aopen+is%3Apr+sort%3Aupdated-asc

 We're hitting close to 300 open PRs. Those are the least recently updated
 ones.

 I think having a low number of stale (i.e. not recently updated) PRs is a
 good thing to shoot for. It doesn't leave contributors hanging (which feels
 bad for contributors), and reduces project clutter (which feels bad for
 maintainers/committers).

 What is our approach to tackling this problem?

 I think communicating and enforcing a clear policy on how stale PRs are
 handled might be a good way to reduce the number of stale PRs we have
 without making contributors feel rejected.

 I don't know what such a policy would look like, but it should be
 enforceable and lightweight--i.e. it shouldn't feel like a hammer used to
 reject people's work, but rather a necessary tool to keep the project's
 contributions relevant and manageable.

 Nick



Re: [Spark SQL] off-heap columnar store

2014-08-26 Thread Evan Chan
What would be the timeline for the parquet caching work?

The reason I'm asking about the columnar compressed format is that
there are some problems for which Parquet is not practical.

On Mon, Aug 25, 2014 at 1:13 PM, Michael Armbrust
mich...@databricks.com wrote:
 What is the plan for getting Tachyon/off-heap support for the columnar
 compressed store?  It's not in 1.1 is it?


 It is not in 1.1 and there are not concrete plans for adding it at this
 point.  Currently, there is more engineering investment going into caching
 parquet data in Tachyon instead.  This approach is going to have much better
 support for nested data, leverages other work being done on parquet, and
 alleviates your concerns about wire format compatibility.

 That said, if someone really wants to try and implement it, I don't think it
 would be very hard.  The primary issue is going to be designing a clean
 interface that is not too tied to this one implementation.


 Also, how likely is the wire format for the columnar compressed data
 to change?  That would be a problem for write-through or persistence.


 We aren't making any guarantees at the moment that it won't change.  Its
 currently only intended for temporary caching of data.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [Spark SQL] off-heap columnar store

2014-08-26 Thread Michael Armbrust

 Any initial proposal or design about the caching to Tachyon that you
 can share so far?


Caching parquet files in tachyon with saveAsParquetFile and then reading
them with parquetFile should already work. You can use SQL on these tables
by using registerTempTable.

Some of the general parquet work that we have been doing includes: #1935
https://github.com/apache/spark/pull/1935, SPARK-2721
https://issues.apache.org/jira/browse/SPARK-2721, SPARK-3036
https://issues.apache.org/jira/browse/SPARK-3036, SPARK-3037
https://issues.apache.org/jira/browse/SPARK-3037 and #1819
https://github.com/apache/spark/pull/1819

The reason I'm asking about the columnar compressed format is that
 there are some problems for which Parquet is not practical.


Can you elaborate?


CoHadoop Papers

2014-08-26 Thread Gary Malouf
One of my colleagues has been questioning me as to why Spark/HDFS makes no
attempts to try to co-locate related data blocks.  He pointed to this
paper: http://www.vldb.org/pvldb/vol4/p575-eltabakh.pdf from 2011 on the
CoHadoop research and the performance improvements it yielded for
Map/Reduce jobs.

Would leveraging these ideas for writing data from Spark make sense/be
worthwhile?


Re: Handling stale PRs

2014-08-26 Thread Matthew Farrellee

On 08/26/2014 04:57 AM, Sean Owen wrote:

On Tue, Aug 26, 2014 at 7:02 AM, Patrick Wendell pwend...@gmail.com wrote:

Most other ASF projects I know just ignore these patches. I'd prefer if we


Agree, this drives me crazy. It kills part of JIRA's usefulness. Spark
is blessed/cursed with incredible inbound load, but would love to
still see the project get this right-er than, say, Hadoop.


totally agree, this applies to patches as well as jiras. i'll add that 
projects who let things simply linger are missing an opportunity to 
engage their community.


spark should capitalize on its momentum to build a smoothly running 
community (vs not and accept an unbounded backlog as inevitable).




The more important thing, maybe, is how we want to deal with this
culturally. And I think we need to do a better job of making sure no pull
requests go unattended (i.e. waiting for committer feedback). If patches go
stale, it should be because the user hasn't responded, not us.


Stale JIRAs are a symptom, not a problem per se. I also want to see
the backlog cleared, but automatically closing doesn't help, if the
problem is too many JIRAs and not enough committer-hours to look at
them. Some noise gets closed, but some easy or important fixes may
disappear as well.


engagement in the community really needs to go both ways. it's 
reasonable for PRs that stop merging or have open comments that need 
resolution by the PRer to be loudly timed out. a similar thing goes for 
jiras, if there's a request for more information to resolve a bug and 
that information does not appear, half of the communication is gone and 
a loud time out is reasonable.


easy and important are in the eyes of the beholder. timeouts can go both 
ways. a jira or pr that has been around for a period of time (say 1/3 
the to-close timeout) should bump up for evaluation, hopefully resulting 
in few easy or important issues falling through the cracks.


fyi, i'm periodically going through the pyspark jiras, trying to 
reproduce issues, coalesce duplicates and ask for details. i've not been 
given any sort of permission to do this, i don't have any special 
position in the community to do this - in a well functioning community 
everyone should feel free to jump in and help.




Another thing is that we should, IMO, err on the side of explicitly saying
no or not yet to patches, rather than letting them linger without
attention. We do get patches where the user is well intentioned, but it is


Completely agree. The solution is partly more supply of committer time
on JIRAs. But that is detracting from the work the committers
themselves want to do. More of the solution is reducing demand by
helping people create useful, actionable, non-duplicate JIRAs from the
start. Or encouraging people to resolve existing JIRAs and shepherding
those in.


saying no/not-yet is a vitally important piece of information.



Elsewhere, I've found people reluctant to close JIRA for fear of
offending or turning off contributors. I think the opposite is true.
There is nothing wrong with no or not now especially accompanied
with constructive feedback. Better to state for the record what is not
being looked at and why, than let people work on and open the same
JIRAs repeatedly.


well stated!



I have also found in the past that a culture of tolerating eternal
JIRAs led people to file JIRAs in order to forget about a problem --
it's in JIRA, and it's in progress, so it feels like someone else is
going to fix it later and so it can be forgotten now.


there's some value in these now-i-can-forget jira, though i'm not 
personally a fan. it can be good to keep them around and reachable by 
search, but they should be clearly marked as no/not-yet or something 
similar.




For what it's worth, I think these project and culture mechanics are
so important and it's my #1 concern for Spark at this stage. This
challenge exists so much more here, exactly because there is so much
potential. I'd love to help by trying to identify and close stale
JIRAs but am afraid that tagging them is just adding to the heap of
work.


+1 concern and potential!


best,


matt

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: CoHadoop Papers

2014-08-26 Thread Gary Malouf
It appears support for this type of control over block placement is going
out in the next version of HDFS:
https://issues.apache.org/jira/browse/HDFS-2576


On Tue, Aug 26, 2014 at 7:43 AM, Gary Malouf malouf.g...@gmail.com wrote:

 One of my colleagues has been questioning me as to why Spark/HDFS makes no
 attempts to try to co-locate related data blocks.  He pointed to this
 paper: http://www.vldb.org/pvldb/vol4/p575-eltabakh.pdf from 2011 on the
 CoHadoop research and the performance improvements it yielded for
 Map/Reduce jobs.

 Would leveraging these ideas for writing data from Spark make sense/be
 worthwhile?





Re: too many CancelledKeyException throwed from ConnectionManager

2014-08-26 Thread Kousuke Saruta

Hi Shengzhe

I faced to same situation.

I think, Connection and ConnectionManager have some race condition issues
and the error you mentioned may be caused by the issues.
Now I'm trying to resolve the issue in 
https://github.com/apache/spark/pull/2019.

Please check it out.

- Kousuke

(2014/08/26 8:53), yao wrote:

Hi Folks,

We are testing our home-made KMeans algorithm using Spark on Yarn.
Recently, we've found that the application failed frequently when doing
clustering over 300,000,000 users (each user is represented by a feature
vector and the whole data set is around 600,000,000). After digging into
the job log, we've found that there are many CancelledKeyException throwed
by ConnectionManager but not observed other exceptions. We double frequent
CancelledKeyException brings the whole application down since the
application often failed on the third or fourth iteration for large
datasets. Welcome to any directional suggestions.

*Errors in job log*:
java.nio.channels.CancelledKeyException
 at
org.apache.spark.network.ConnectionManager.run(ConnectionManager.scala:363)
 at
org.apache.spark.network.ConnectionManager$$anon$4.run(ConnectionManager.scala:116)
14/08/25 19:04:32 INFO ConnectionManager: Removing ReceivingConnection to
ConnectionManagerId(lsv-289.rfiserve.net,43199)
14/08/25 19:04:32 ERROR ConnectionManager: Corresponding
SendingConnectionManagerId not found
14/08/25 19:04:32 INFO ConnectionManager: Key not valid ?
sun.nio.ch.SelectionKeyImpl@2570cd62
14/08/25 19:04:32 INFO ConnectionManager: key already cancelled ?
sun.nio.ch.SelectionKeyImpl@2570cd62
java.nio.channels.CancelledKeyException
 at
org.apache.spark.network.ConnectionManager.run(ConnectionManager.scala:363)
 at
org.apache.spark.network.ConnectionManager$$anon$4.run(ConnectionManager.scala:116)
14/08/25 19:04:32 INFO ConnectionManager: Removing ReceivingConnection to
ConnectionManagerId(lsv-289.rfiserve.net,56727)
14/08/25 19:04:32 INFO ConnectionManager: Removing SendingConnection to
ConnectionManagerId(lsv-289.rfiserve.net,56727)
14/08/25 19:04:32 INFO ConnectionManager: Removing SendingConnection to
ConnectionManagerId(lsv-289.rfiserve.net,56727)
14/08/25 19:04:32 INFO ConnectionManager: Key not valid ?
sun.nio.ch.SelectionKeyImpl@37c8b85a
14/08/25 19:04:32 INFO ConnectionManager: key already cancelled ?
sun.nio.ch.SelectionKeyImpl@37c8b85a
java.nio.channels.CancelledKeyException
 at
org.apache.spark.network.ConnectionManager.run(ConnectionManager.scala:287)
 at
org.apache.spark.network.ConnectionManager$$anon$4.run(ConnectionManager.scala:116)
14/08/25 19:04:32 INFO ConnectionManager: Removing SendingConnection to
ConnectionManagerId(lsv-668.rfiserve.net,41913)
14/08/25 19:04:32 INFO ConnectionManager: Removing ReceivingConnection to
ConnectionManagerId(lsv-668.rfiserve.net,41913)
14/08/25 19:04:32 INFO ConnectionManager: Key not valid ?
sun.nio.ch.SelectionKeyImpl@fcea3a4
14/08/25 19:04:32 ERROR ConnectionManager: Corresponding
SendingConnectionManagerId not found
14/08/25 19:04:32 INFO ConnectionManager: key already cancelled ?
sun.nio.ch.SelectionKeyImpl@fcea3a4


Best
Shengzhe




-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Handling stale PRs

2014-08-26 Thread Madhu
Sean Owen wrote
 Stale JIRAs are a symptom, not a problem per se. I also want to see
 the backlog cleared, but automatically closing doesn't help, if the
 problem is too many JIRAs and not enough committer-hours to look at
 them. Some noise gets closed, but some easy or important fixes may
 disappear as well.

Agreed. All of the problems mentioned in this thread are symptoms. There's
no shortage of talent and enthusiasm within the Spark community. The people
and the product are wonderful. The process: not so much. Spark has been
wildly successful, some growing pains are to be expected.

Given 100+ contributors, Spark is a big project. As with big data, big
projects can run into scaling issues. There's no magic to running a
successful big project, but it does require greater planning and discipline.
JIRA is great for issue tracking, but it's not a replacement for a project
plan. Quarterly releases are a great idea, everyone knows the schedule. What
we need is concise plan for each release with a clear scope statement.
Without knowing what is in scope and out of scope for a release, we end up
with a laundry list of things to do, but no clear goal. Laundry lists don't
scale well.

I don't mind helping with planning and documenting releases. This is
especially helpful for new contributors who don't know where to start. I
have done that successfully on many projects using Jira and Confluence, so I
know it can be done. To address immediate concerns of open PRs and
excessive, overlapping Jira issues, we probably have to create a meta issue
and assign resources to fix it. I don't mind helping with that also.



-
--
Madhu
https://www.linkedin.com/in/msiddalingaiah
--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Handling-stale-PRs-tp8015p8031.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Handling stale PRs

2014-08-26 Thread Erik Erlandson


- Original Message -

  Another thing is that we should, IMO, err on the side of explicitly saying
  no or not yet to patches, rather than letting them linger without
  attention. We do get patches where the user is well intentioned, but it is
 
  Completely agree. The solution is partly more supply of committer time
  on JIRAs. But that is detracting from the work the committers
  themselves want to do. More of the solution is reducing demand by
  helping people create useful, actionable, non-duplicate JIRAs from the
  start. Or encouraging people to resolve existing JIRAs and shepherding
  those in.
 
 saying no/not-yet is a vitally important piece of information.

+1, when I propose a contribution to a project, I consider an articulate (and 
hopefully polite) no thanks, or not-yet, or needs-work, to be far more 
useful and pleasing than just radio silence.  It allows me to either address 
feedback, or just move on.

Although it takes effort to keep abreast of community contributions, I don't 
think it needs to be an overbearing or heavy-weight process.  I've seen other 
communities where they talked themselves out of better management because they 
conceived the ticket workflow as being more effort than it needed to be.  Much 
better to keep ticket triage and workflow fast/simple, but actually do it.



 
 
  Elsewhere, I've found people reluctant to close JIRA for fear of
  offending or turning off contributors. I think the opposite is true.
  There is nothing wrong with no or not now especially accompanied
  with constructive feedback. Better to state for the record what is not
  being looked at and why, than let people work on and open the same
  JIRAs repeatedly.
 
 well stated!
 
 
  I have also found in the past that a culture of tolerating eternal
  JIRAs led people to file JIRAs in order to forget about a problem --
  it's in JIRA, and it's in progress, so it feels like someone else is
  going to fix it later and so it can be forgotten now.
 
 there's some value in these now-i-can-forget jira, though i'm not
 personally a fan. it can be good to keep them around and reachable by
 search, but they should be clearly marked as no/not-yet or something
 similar.
 
 
  For what it's worth, I think these project and culture mechanics are
  so important and it's my #1 concern for Spark at this stage. This
  challenge exists so much more here, exactly because there is so much
  potential. I'd love to help by trying to identify and close stale
  JIRAs but am afraid that tagging them is just adding to the heap of
  work.
 
 +1 concern and potential!
 
 
 best,
 
 
 matt
 
 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org
 
 

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: CoHadoop Papers

2014-08-26 Thread Christopher Nguyen
Gary, do you mean Spark and HDFS separately, or Spark's use of HDFS?

If the former, Spark does support copartitioning.

If the latter, it's an HDFS scope that's outside of Spark. On that note,
Hadoop does also make attempts to collocate data, e.g., rack awareness. I'm
sure the paper makes useful contributions for its set of use cases.

Sent while mobile. Pls excuse typos etc.
On Aug 26, 2014 5:21 AM, Gary Malouf malouf.g...@gmail.com wrote:

 It appears support for this type of control over block placement is going
 out in the next version of HDFS:
 https://issues.apache.org/jira/browse/HDFS-2576


 On Tue, Aug 26, 2014 at 7:43 AM, Gary Malouf malouf.g...@gmail.com
 wrote:

  One of my colleagues has been questioning me as to why Spark/HDFS makes
 no
  attempts to try to co-locate related data blocks.  He pointed to this
  paper: http://www.vldb.org/pvldb/vol4/p575-eltabakh.pdf from 2011 on the
  CoHadoop research and the performance improvements it yielded for
  Map/Reduce jobs.
 
  Would leveraging these ideas for writing data from Spark make sense/be
  worthwhile?
 
 
 



HiveContext, schemaRDD.printSchema get different dataTypes, feature or a bug? really strange and surprised...

2014-08-26 Thread chutium
is there any dataType auto convert or detect or something in HiveContext ?all
columns of a table is defined as string in hive metastoreone column is
total_price with values like 123.45, then this column will be recognized as
dataType Float in HiveContext...this is a feature or a bug? it really
surprised me... how is it implemented? if it is a feature, can i turn it
off? i want to get a schemaRDD with exactly the same datatype defined in
hive metadata, i know the column total_price should be float values, but
they must not be, and what happens if there is some broken line in my huge
CSV file? or maybe some total_price is 9,123.45 or $123.45 or
something==some
example for this in our env.MapR v3 cluster, newest spark github master
clone from yesterdaybuilt withsbt/sbt -Dhadoop.version=1.0.3-mapr-3.0.3
-Phive assemblyhive-site.xml
configured==spark-shell
scripts:val hiveContext = new
org.apache.spark.sql.hive.HiveContext(sc)hiveContext.sql(use
our_live_db)hiveContext.sql(desc formatted
et_fullorders).collect.foreach(println)..14/08/26 15:47:09 INFO
SparkContext: Job finished: collect at SparkPlan.scala:85, took 0.0305408
s[# col_name data_type   comment ][ 
  
][sidstring  from deserializer  
][request_id string  from deserializer  
][*times_dq   string*  from deserializer  
][*total_pricestring*  from deserializer  
][order_id   string  from deserializer   ][ 
  
][# Partition Information ][# col_name data_type
  
comment ][][wt_datestring   
  
None][countrystring  None   

][][# Detailed Table Information][Database: 

our_live_db][Owner: client02 
][CreateTime:Fri Jan 31 12:23:40 CET 2014 ][LastAccessTime: 
  
UNKNOWN  ][Protect Mode:  None
][Retention: 0][Location: 
maprfs:/mapr/cluster01.xxx.net/common/external_tables/et_fullorders
][Table Type:EXTERNAL_TABLE   ][Table Parameters:   
   
][   EXTERNALTRUE][  
transient_lastDdlTime   1391167420  ][][# Storage
Information   ][SerDe Library:
com.bizo.hive.serde.csv.CSVSerde ][InputFormat:  
org.apache.hadoop.mapred.TextInputFormat ][OutputFormat: 
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat  
][Compressed:No   ][Num Buckets:  
-1   ][Bucket Columns:[]  
][Sort Columns:  []   ][Storage Desc Params:
   
][   separatorChar   ;   ][  
serialization.format1   ]then, create a schemaRDD from
this tableval result = hiveContext.sql(select sid, order_id, total_price,
times_dq from et_fullorders where wt_date='2014-04-14' and country='uk'
limit 5)ok now, printSchema...scala result.printSchemaroot |-- sid: string
(nullable = true) |-- order_id: string (nullable = true) |-- *total_price:
float* (nullable = true) |-- *times_dq: timestamp* (nullable =
true)total_price was STRING but now in schemaRDD is FLOATandtimes_dq, now is
TIMESTAMPreally strange and surprised...and more strange is:scala
result.map(row = row.getString(2)).collect.foreach(println)i
got240.0045.8321.6795.83120.83butscala result.map(row =
row.getFloat(2)).collect.foreach(println)14/08/26 16:01:24 ERROR Executor:
Exception in task 0.0 in stage 9.0 (TID 8)java.lang.ClassCastException:
java.lang.String cannot be cast to java.lang.Floatat
scala.runtime.BoxesRunTime.unboxToFloat(BoxesRunTime.java:114)==btw,
files in this external table are gzipped csv files:14/08/26 15:49:56 INFO
HadoopRDD: Input split:
maprfs:/mapr/cluster01.xxx.net/common/external_tables/et_fullorders/wt_date=2014-04-14/country=uk/getFullOrders_2014-04-14.csv.gz:0+16990and
the data in it:scala
result.collect.foreach(println)[51402123123,12344000123454,240.00,2014-04-14
00:03:49.082000][51402110123,12344000123455,45.83,2014-04-14
00:04:13.639000][51402129123,12344000123458,21.67,2014-04-14
00:09:12.276000][51402092123,12344000132457,95.83,2014-04-14
00:09:42.228000][51402135123,12344000123460,120.83,2014-04-14
00:12:44.742000]we use CSVSerDe
https://drone.io/github.com/ogrodnek/csv-serde/files/target/csv-serde-1.1.2-0.11.0-all.jarmaybe

HiveContext, schemaRDD.printSchema get different dataTypes, feature or a bug? really strange and surprised...

2014-08-26 Thread chutium
is there any dataType auto convert or detect or something in HiveContext ?

all columns of a table is defined as string in hive metastore

one column is total_price with values like 123.45, then this column will be
recognized as dataType Float in HiveContext...

this is a feature or a bug? it really surprised me... how is it implemented?
if it is a feature, can i turn it off? i want to get a schemaRDD with
exactly the same datatype defined in hive metadata, i know the column
total_price should be float values, but they must not be, and what happens
if there is some broken line in my huge CSV file? or maybe some total_price
is 9,123.45 or $123.45 or something

==

some example for this in our env.

MapR v3 cluster, newest spark github master clone from yesterday

built with
sbt/sbt -Dhadoop.version=1.0.3-mapr-3.0.3 -Phive assembly

hive-site.xml configured

==

spark-shell scripts:

val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
hiveContext.sql(use our_live_db)
hiveContext.sql(desc formatted et_fullorders).collect.foreach(println)
...
...
14/08/26 15:47:09 INFO SparkContext: Job finished: collect at
SparkPlan.scala:85, took 0.0305408 s
[# col_name data_type   comment ]
[]
[sidstring  from deserializer   ]
[request_id string  from deserializer   ]
[*times_dq   string*  from deserializer   ]
[*total_pricestring*  from deserializer   ]
[order_id   string  from deserializer   ]
[]
[# Partition Information ]
[# col_name data_type   comment ]
[]
[wt_datestring  None]
[countrystring  None]
[]
[# Detailed Table Information]
[Database:  our_live_db]
[Owner: client02  ]
[CreateTime:Fri Jan 31 12:23:40 CET 2014 ]
[LastAccessTime:UNKNOWN  ]
[Protect Mode:  None ]
[Retention: 0]
[Location: 
maprfs:/mapr/cluster01.xxx.net/common/external_tables/et_fullorders ]
[Table Type:EXTERNAL_TABLE   ]
[Table Parameters:   ]
[   EXTERNALTRUE]
[   transient_lastDdlTime   1391167420  ]
[]
[# Storage Information   ]
[SerDe Library: com.bizo.hive.serde.csv.CSVSerde ]
[InputFormat:   org.apache.hadoop.mapred.TextInputFormat ]
[OutputFormat: 
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat   ]
[Compressed:No   ]
[Num Buckets:   -1   ]
[Bucket Columns:[]   ]
[Sort Columns:  []   ]
[Storage Desc Params:]
[   separatorChar   ;   ]
[   serialization.format1   ]

then, create a schemaRDD from this table

val result = hiveContext.sql(select sid, order_id, total_price, times_dq
from et_fullorders where wt_date='2014-04-14' and country='uk' limit 5)

ok now, printSchema...

scala result.printSchema
root
 |-- sid: string (nullable = true)
 |-- order_id: string (nullable = true)
 |-- *total_price: float* (nullable = true)
 |-- *times_dq: timestamp* (nullable = true)


total_price was STRING but now in schemaRDD is FLOAT
and
times_dq, now is TIMESTAMP

really strange and surprised...

and more strange is:

scala result.map(row = row.getString(2)).collect.foreach(println)

i got
240.00
45.83
21.67
95.83
120.83

but

scala result.map(row = row.getFloat(2)).collect.foreach(println)

14/08/26 16:01:24 ERROR Executor: Exception in task 0.0 in stage 9.0 (TID 8)
java.lang.ClassCastException: java.lang.String cannot be cast to
java.lang.Float
at scala.runtime.BoxesRunTime.unboxToFloat(BoxesRunTime.java:114)

==

btw, files in this external table are gzipped csv files:
14/08/26 15:49:56 INFO HadoopRDD: Input split:
maprfs:/mapr/cluster01.xxx.net/common/external_tables/et_fullorders/wt_date=2014-04-14/country=uk/getFullOrders_2014-04-14.csv.gz:0+16990

and the data in it:

scala result.collect.foreach(println)
[51402123123,12344000123454,240.00,2014-04-14 00:03:49.082000]
[51402110123,12344000123455,45.83,2014-04-14 00:04:13.639000]
[51402129123,12344000123458,21.67,2014-04-14 00:09:12.276000]
[51402092123,12344000132457,95.83,2014-04-14 00:09:42.228000]
[51402135123,12344000123460,120.83,2014-04-14 00:12:44.742000]


Re: Gradient descent and runMiniBatchSGD

2014-08-26 Thread RJ Nowling
Hi Alexander,

Can you post a link to the code?

RJ


On Tue, Aug 26, 2014 at 6:53 AM, Ulanov, Alexander alexander.ula...@hp.com
wrote:

 Hi,

 I've implemented back propagation algorithm using Gradient class and a
 simple update using Updater class. Then I run the algorithm with mllib's
 GradientDescent class. I have troubles in scaling out this implementation.
 I thought that if I partition my data into the number of workers then
 performance will increase, because each worker will run a step of gradient
 descent on its partition of data. But this does not happen and each worker
 seems to process all data (if miniBatchFraction == 1.0 as in mllib's
 logisic regression implementation). For me, this doesn't make sense,
 because then only single Worker will provide the same performance. Could
 someone elaborate on this and correct me if I am wrong. How can I scale out
 the algorithm with many Workers?

 Best regards, Alexander




-- 
em rnowl...@gmail.com
c 954.496.2314


Re: Handling stale PRs

2014-08-26 Thread Nicholas Chammas
On Tue, Aug 26, 2014 at 2:02 AM, Patrick Wendell pwend...@gmail.com wrote:

 I'd prefer if we took the approach of politely explaining why in the
 current form the patch isn't acceptable and closing it (potentially w/ tips
 on how to improve it or narrow the scope).


Amen to this. Aiming for such a culture would set Spark apart from other
projects in a great way.

I've proposed several different solutions to ASF infra to streamline the
 process, but thus far they haven't been open to any of my ideas:


I've added myself as a watcher on those 2 INFRA issues. Sucks that the only
solution on offer right now requires basically polluting the commit history.

Short of moving Spark's repo to a non-ASF-managed GitHub account, do you
think another bot could help us manage the number of stale PRs?

I'm thinking a solution as follows might be very helpful:

   - Extend Spark QA / Jenkins to run on a weekly schedule and check for
   stale PRs. Let's say a stale PR is an open one that hasn't been updated in
   N months.
   - Spark QA maintains a list of known committers on its side.
   - During its weekly check of stale PRs, Spark QA takes the following
   action:
  - If the last person to comment on a PR was a committer, post to the
  PR asking for an update from the contributor.
  - If the last person to comment on a PR was a contributor, add the PR
  to a list. Email this list of *hanging PRs* out to the dev list on a
  weekly basis and ask committers to update them.
  - If the last person to comment on a PR was Spark QA asking the
  contributor to update it, then add the PR to a list. Email this
list of *abandoned
  PRs* to the dev list for the record (or for closing, if that becomes
  possible in the future).

This doesn't solve the problem of not being able to close PRs, but it does
help make sure no PR is left hanging for long.

What do you think? I'd be interested in implementing this solution if we
like it.

Nick


Re: HiveContext, schemaRDD.printSchema get different dataTypes, feature or a bug? really strange and surprised...

2014-08-26 Thread chutium
oops, i tried on a managed table, column types will not be changed

so it is mostly due to the serde lib CSVSerDe
(https://github.com/ogrodnek/csv-serde/blob/master/src/main/java/com/bizo/hive/serde/csv/CSVSerde.java#L123)
or maybe CSVReader from opencsv?...

but if the columns are defined as string, no matter what type returned from
custom SerDe or CSVReader, they should be cast to string at the end right?

why do not use the schema from hive metadata directly?



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/HiveContext-schemaRDD-printSchema-get-different-dataTypes-feature-or-a-bug-really-strange-and-surpri-tp8035p8039.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: CoHadoop Papers

2014-08-26 Thread Gary Malouf
Christopher, can you expand on the co-partitioning support?

We have a number of spark SQL tables (saved in parquet format) that all
could be considered to have a common hash key.  Our analytics team wants to
do frequent joins across these different data-sets based on this key.  It
makes sense that if the data for each key across 'tables' was co-located on
the same server, shuffles could be minimized and ultimately performance
could be much better.

From reading the HDFS issue I posted before, the way is being paved for
implementing this type of behavior though there are a lot of complications
to make it work I believe.


On Tue, Aug 26, 2014 at 10:40 AM, Christopher Nguyen c...@adatao.com wrote:

 Gary, do you mean Spark and HDFS separately, or Spark's use of HDFS?

 If the former, Spark does support copartitioning.

 If the latter, it's an HDFS scope that's outside of Spark. On that note,
 Hadoop does also make attempts to collocate data, e.g., rack awareness. I'm
 sure the paper makes useful contributions for its set of use cases.

 Sent while mobile. Pls excuse typos etc.
 On Aug 26, 2014 5:21 AM, Gary Malouf malouf.g...@gmail.com wrote:

 It appears support for this type of control over block placement is going
 out in the next version of HDFS:
 https://issues.apache.org/jira/browse/HDFS-2576


 On Tue, Aug 26, 2014 at 7:43 AM, Gary Malouf malouf.g...@gmail.com
 wrote:

  One of my colleagues has been questioning me as to why Spark/HDFS makes
 no
  attempts to try to co-locate related data blocks.  He pointed to this
  paper: http://www.vldb.org/pvldb/vol4/p575-eltabakh.pdf from 2011 on
 the
  CoHadoop research and the performance improvements it yielded for
  Map/Reduce jobs.
 
  Would leveraging these ideas for writing data from Spark make sense/be
  worthwhile?
 
 
 




Re: Handling stale PRs

2014-08-26 Thread Josh Rosen
Last weekend, I started hacking on a Google App Engine app for helping with 
pull request review (screenshot: http://i.imgur.com/wwpZKYZ.png).  Some of my 
basic goals (not all implemented yet):

- Users sign in using GitHub and can browse a list of pull requests, including 
links to associated JIRAs, Jenkins statuses, a quick preview of the last 
comment, etc.

- Pull requests are auto-classified based on which components they modify (by 
looking at the diff).

- From the app’s own internal database of PRs, we can build dashboards to find 
“abandoned” PRs, graph average time to first review, etc.

- Since we authenticate users with GitHub, we can enable administrative 
functions via this dashboard (e.g. “assign this PR to me”, “vote to close in 
the weekly auto-close commit”, etc.

Right now, I’ve implemented GItHub OAuth support and code to update the issues 
database using the GitHub API.  Because we have access to the full API, it’s 
pretty easy to do fancy things like parsing the reason for Jenkins failure, 
etc.  You could even imagine some fancy mashup tools to pull up JIRAs and pull 
requests side-by in iframes.

After I hack on this a bit more, I plan to release a public preview version; if 
we find this tool useful, I’ll clean it up and open-source the app so folks can 
contribute to it.

- Josh

On August 26, 2014 at 8:16:46 AM, Nicholas Chammas (nicholas.cham...@gmail.com) 
wrote:

On Tue, Aug 26, 2014 at 2:02 AM, Patrick Wendell pwend...@gmail.com wrote:  

 I'd prefer if we took the approach of politely explaining why in the  
 current form the patch isn't acceptable and closing it (potentially w/ tips  
 on how to improve it or narrow the scope).  


Amen to this. Aiming for such a culture would set Spark apart from other  
projects in a great way.  

I've proposed several different solutions to ASF infra to streamline the  
 process, but thus far they haven't been open to any of my ideas:  


I've added myself as a watcher on those 2 INFRA issues. Sucks that the only  
solution on offer right now requires basically polluting the commit history.  

Short of moving Spark's repo to a non-ASF-managed GitHub account, do you  
think another bot could help us manage the number of stale PRs?  

I'm thinking a solution as follows might be very helpful:  

- Extend Spark QA / Jenkins to run on a weekly schedule and check for  
stale PRs. Let's say a stale PR is an open one that hasn't been updated in  
N months.  
- Spark QA maintains a list of known committers on its side.  
- During its weekly check of stale PRs, Spark QA takes the following  
action:  
- If the last person to comment on a PR was a committer, post to the  
PR asking for an update from the contributor.  
- If the last person to comment on a PR was a contributor, add the PR  
to a list. Email this list of *hanging PRs* out to the dev list on a  
weekly basis and ask committers to update them.  
- If the last person to comment on a PR was Spark QA asking the  
contributor to update it, then add the PR to a list. Email this  
list of *abandoned  
PRs* to the dev list for the record (or for closing, if that becomes  
possible in the future).  

This doesn't solve the problem of not being able to close PRs, but it does  
help make sure no PR is left hanging for long.  

What do you think? I'd be interested in implementing this solution if we  
like it.  

Nick  


Re: Handling stale PRs

2014-08-26 Thread Nicholas Chammas
OK, that sounds pretty cool.

Josh,

Do you see this app as encompassing or supplanting the functionality I
described as well?

Nick


On Tue, Aug 26, 2014 at 2:21 PM, Josh Rosen rosenvi...@gmail.com wrote:

 Last weekend, I started hacking on a Google App Engine app for helping
 with pull request review (screenshot: http://i.imgur.com/wwpZKYZ.png).
  Some of my basic goals (not all implemented yet):

 - Users sign in using GitHub and can browse a list of pull requests,
 including links to associated JIRAs, Jenkins statuses, a quick preview of
 the last comment, etc.

 - Pull requests are auto-classified based on which components they modify
 (by looking at the diff).

 - From the app’s own internal database of PRs, we can build dashboards to
 find “abandoned” PRs, graph average time to first review, etc.

 - Since we authenticate users with GitHub, we can enable administrative
 functions via this dashboard (e.g. “assign this PR to me”, “vote to close
 in the weekly auto-close commit”, etc.

 Right now, I’ve implemented GItHub OAuth support and code to update the
 issues database using the GitHub API.  Because we have access to the full
 API, it’s pretty easy to do fancy things like parsing the reason for
 Jenkins failure, etc.  You could even imagine some fancy mashup tools to
 pull up JIRAs and pull requests side-by in iframes.

 After I hack on this a bit more, I plan to release a public preview
 version; if we find this tool useful, I’ll clean it up and open-source the
 app so folks can contribute to it.

 - Josh

 On August 26, 2014 at 8:16:46 AM, Nicholas Chammas (
 nicholas.cham...@gmail.com) wrote:

 On Tue, Aug 26, 2014 at 2:02 AM, Patrick Wendell pwend...@gmail.com
 wrote:

  I'd prefer if we took the approach of politely explaining why in the
  current form the patch isn't acceptable and closing it (potentially w/
 tips
  on how to improve it or narrow the scope).


 Amen to this. Aiming for such a culture would set Spark apart from other
 projects in a great way.

 I've proposed several different solutions to ASF infra to streamline the
  process, but thus far they haven't been open to any of my ideas:


 I've added myself as a watcher on those 2 INFRA issues. Sucks that the
 only
 solution on offer right now requires basically polluting the commit
 history.

 Short of moving Spark's repo to a non-ASF-managed GitHub account, do you
 think another bot could help us manage the number of stale PRs?

 I'm thinking a solution as follows might be very helpful:

 - Extend Spark QA / Jenkins to run on a weekly schedule and check for
 stale PRs. Let's say a stale PR is an open one that hasn't been updated in
 N months.
 - Spark QA maintains a list of known committers on its side.
 - During its weekly check of stale PRs, Spark QA takes the following
 action:
 - If the last person to comment on a PR was a committer, post to the
 PR asking for an update from the contributor.
 - If the last person to comment on a PR was a contributor, add the PR
 to a list. Email this list of *hanging PRs* out to the dev list on a
 weekly basis and ask committers to update them.
 - If the last person to comment on a PR was Spark QA asking the
 contributor to update it, then add the PR to a list. Email this
 list of *abandoned
 PRs* to the dev list for the record (or for closing, if that becomes
 possible in the future).

 This doesn't solve the problem of not being able to close PRs, but it does
 help make sure no PR is left hanging for long.

 What do you think? I'd be interested in implementing this solution if we
 like it.

 Nick




Re: Handling stale PRs

2014-08-26 Thread Josh Rosen
Sure; App Engine supports cron and sending emails.  We can configure the app 
with Spark QA’s credentials in order to allow it to post comments on issues, 
etc.

- Josh

On August 26, 2014 at 11:38:08 AM, Nicholas Chammas 
(nicholas.cham...@gmail.com) wrote:

OK, that sounds pretty cool.

Josh,

Do you see this app as encompassing or supplanting the functionality I 
described as well?

Nick


On Tue, Aug 26, 2014 at 2:21 PM, Josh Rosen rosenvi...@gmail.com wrote:
Last weekend, I started hacking on a Google App Engine app for helping with 
pull request review (screenshot: http://i.imgur.com/wwpZKYZ.png).  Some of my 
basic goals (not all implemented yet):

- Users sign in using GitHub and can browse a list of pull requests, including 
links to associated JIRAs, Jenkins statuses, a quick preview of the last 
comment, etc.

- Pull requests are auto-classified based on which components they modify (by 
looking at the diff).

- From the app’s own internal database of PRs, we can build dashboards to find 
“abandoned” PRs, graph average time to first review, etc.

- Since we authenticate users with GitHub, we can enable administrative 
functions via this dashboard (e.g. “assign this PR to me”, “vote to close in 
the weekly auto-close commit”, etc.

Right now, I’ve implemented GItHub OAuth support and code to update the issues 
database using the GitHub API.  Because we have access to the full API, it’s 
pretty easy to do fancy things like parsing the reason for Jenkins failure, 
etc.  You could even imagine some fancy mashup tools to pull up JIRAs and pull 
requests side-by in iframes.

After I hack on this a bit more, I plan to release a public preview version; if 
we find this tool useful, I’ll clean it up and open-source the app so folks can 
contribute to it.

- Josh

On August 26, 2014 at 8:16:46 AM, Nicholas Chammas (nicholas.cham...@gmail.com) 
wrote:

On Tue, Aug 26, 2014 at 2:02 AM, Patrick Wendell pwend...@gmail.com wrote:

 I'd prefer if we took the approach of politely explaining why in the
 current form the patch isn't acceptable and closing it (potentially w/ tips
 on how to improve it or narrow the scope).


Amen to this. Aiming for such a culture would set Spark apart from other
projects in a great way.

I've proposed several different solutions to ASF infra to streamline the
 process, but thus far they haven't been open to any of my ideas:


I've added myself as a watcher on those 2 INFRA issues. Sucks that the only
solution on offer right now requires basically polluting the commit history.

Short of moving Spark's repo to a non-ASF-managed GitHub account, do you
think another bot could help us manage the number of stale PRs?

I'm thinking a solution as follows might be very helpful:

- Extend Spark QA / Jenkins to run on a weekly schedule and check for
stale PRs. Let's say a stale PR is an open one that hasn't been updated in
N months.
- Spark QA maintains a list of known committers on its side.
- During its weekly check of stale PRs, Spark QA takes the following
action:
- If the last person to comment on a PR was a committer, post to the
PR asking for an update from the contributor.
- If the last person to comment on a PR was a contributor, add the PR
to a list. Email this list of *hanging PRs* out to the dev list on a
weekly basis and ask committers to update them.
- If the last person to comment on a PR was Spark QA asking the
contributor to update it, then add the PR to a list. Email this
list of *abandoned
PRs* to the dev list for the record (or for closing, if that becomes
possible in the future).

This doesn't solve the problem of not being able to close PRs, but it does
help make sure no PR is left hanging for long.

What do you think? I'd be interested in implementing this solution if we
like it.

Nick



Re: [SPARK-2878] Kryo serialisation with custom Kryo registrator failing

2014-08-26 Thread npanj
I have both SPARK-2878 and SPARK-2893. 



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/SPARK-2878-Kryo-serialisation-with-custom-Kryo-registrator-failing-tp7719p8046.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



spark-ec2 1.0.2 creates EC2 cluster at wrong version

2014-08-26 Thread Nicholas Chammas
I downloaded the source code release for 1.0.2 from here
http://spark.apache.org/downloads.html and launched an EC2 cluster using
spark-ec2.

After the cluster finishes launching, I fire up the shell and check the
version:

scala sc.version
res1: String = 1.0.1

The startup banner also shows the same thing. Hmm...

So I dig around and find that the spark_ec2.py script has the default Spark
version set to 1.0.1.

Derp.

  parser.add_option(-v, --spark-version, default=1.0.1,
  help=Version of Spark to use: 'X.Y.Z' or a specific git hash)

Is there any way to fix the release? It’s a minor issue, but could be very
confusing. And how can we prevent this from happening again?

Nick
​


Re: spark-ec2 1.0.2 creates EC2 cluster at wrong version

2014-08-26 Thread Shivaram Venkataraman
This is a chicken and egg problem in some sense. We can't change the ec2
script till we have made the release and uploaded the binaries -- But once
that is done, we can't update the script.

I think the model we support so far  is that you can launch the latest
spark version from the master branch on github. I guess we can try to add
something in the release process that updates the script but doesn't commit
it ? The release managers might be able to add more.

Thanks
Shivaram


On Tue, Aug 26, 2014 at 1:16 PM, Nicholas Chammas 
nicholas.cham...@gmail.com wrote:

 I downloaded the source code release for 1.0.2 from here
 http://spark.apache.org/downloads.html and launched an EC2 cluster using
 spark-ec2.

 After the cluster finishes launching, I fire up the shell and check the
 version:

 scala sc.version
 res1: String = 1.0.1

 The startup banner also shows the same thing. Hmm...

 So I dig around and find that the spark_ec2.py script has the default Spark
 version set to 1.0.1.

 Derp.

   parser.add_option(-v, --spark-version, default=1.0.1,
   help=Version of Spark to use: 'X.Y.Z' or a specific git hash)

 Is there any way to fix the release? It’s a minor issue, but could be very
 confusing. And how can we prevent this from happening again?

 Nick
 ​



Re: Gradient descent and runMiniBatchSGD

2014-08-26 Thread RJ Nowling
Xiangrui,

I posted a note on my JIRA for MiniBatch KMeans about the same problem --
sampling running in O(n).

Can you elaborate on ways to get more efficient sampling?  I think this
will be important for a variety of stochastic algorithms.

RJ


On Tue, Aug 26, 2014 at 12:54 PM, Xiangrui Meng men...@gmail.com wrote:

 miniBatchFraction uses RDD.sample to get the mini-batch, and sample
 still needs to visit the elements one after another. So it is not
 efficient if the task is not computation heavy and this is why
 setMiniBatchFraction is marked as experimental. If we can detect that
 the partition iterator is backed by an ArrayBuffer, maybe we can do a
 skip iterator to skip elements. -Xiangrui

 On Tue, Aug 26, 2014 at 8:15 AM, Ulanov, Alexander
 alexander.ula...@hp.com wrote:
  Hi, RJ
 
 
 https://github.com/avulanov/spark/blob/neuralnetwork/mllib/src/main/scala/org/apache/spark/mllib/classification/NeuralNetwork.scala
 
  Unit tests are in the same branch.
 
  Alexander
 
  From: RJ Nowling [mailto:rnowl...@gmail.com]
  Sent: Tuesday, August 26, 2014 6:59 PM
  To: Ulanov, Alexander
  Cc: dev@spark.apache.org
  Subject: Re: Gradient descent and runMiniBatchSGD
 
  Hi Alexander,
 
  Can you post a link to the code?
 
  RJ
 
  On Tue, Aug 26, 2014 at 6:53 AM, Ulanov, Alexander 
 alexander.ula...@hp.commailto:alexander.ula...@hp.com wrote:
  Hi,
 
  I've implemented back propagation algorithm using Gradient class and a
 simple update using Updater class. Then I run the algorithm with mllib's
 GradientDescent class. I have troubles in scaling out this implementation.
 I thought that if I partition my data into the number of workers then
 performance will increase, because each worker will run a step of gradient
 descent on its partition of data. But this does not happen and each worker
 seems to process all data (if miniBatchFraction == 1.0 as in mllib's
 logisic regression implementation). For me, this doesn't make sense,
 because then only single Worker will provide the same performance. Could
 someone elaborate on this and correct me if I am wrong. How can I scale out
 the algorithm with many Workers?
 
  Best regards, Alexander
 
 
 
  --
  em rnowl...@gmail.commailto:rnowl...@gmail.com
  c 954.496.2314




-- 
em rnowl...@gmail.com
c 954.496.2314


Re: Gradient descent and runMiniBatchSGD

2014-08-26 Thread RJ Nowling
Also, another idea: may algorithms that use sampling tend to do so multiple
times.  It may be beneficial to allow a transformation to a representation
that is more efficient for multiple rounds of sampling.


On Tue, Aug 26, 2014 at 4:36 PM, RJ Nowling rnowl...@gmail.com wrote:

 Xiangrui,

 I posted a note on my JIRA for MiniBatch KMeans about the same problem --
 sampling running in O(n).

 Can you elaborate on ways to get more efficient sampling?  I think this
 will be important for a variety of stochastic algorithms.

 RJ


 On Tue, Aug 26, 2014 at 12:54 PM, Xiangrui Meng men...@gmail.com wrote:

 miniBatchFraction uses RDD.sample to get the mini-batch, and sample
 still needs to visit the elements one after another. So it is not
 efficient if the task is not computation heavy and this is why
 setMiniBatchFraction is marked as experimental. If we can detect that
 the partition iterator is backed by an ArrayBuffer, maybe we can do a
 skip iterator to skip elements. -Xiangrui

 On Tue, Aug 26, 2014 at 8:15 AM, Ulanov, Alexander
 alexander.ula...@hp.com wrote:
  Hi, RJ
 
 
 https://github.com/avulanov/spark/blob/neuralnetwork/mllib/src/main/scala/org/apache/spark/mllib/classification/NeuralNetwork.scala
 
  Unit tests are in the same branch.
 
  Alexander
 
  From: RJ Nowling [mailto:rnowl...@gmail.com]
  Sent: Tuesday, August 26, 2014 6:59 PM
  To: Ulanov, Alexander
  Cc: dev@spark.apache.org
  Subject: Re: Gradient descent and runMiniBatchSGD
 
  Hi Alexander,
 
  Can you post a link to the code?
 
  RJ
 
  On Tue, Aug 26, 2014 at 6:53 AM, Ulanov, Alexander 
 alexander.ula...@hp.commailto:alexander.ula...@hp.com wrote:
  Hi,
 
  I've implemented back propagation algorithm using Gradient class and a
 simple update using Updater class. Then I run the algorithm with mllib's
 GradientDescent class. I have troubles in scaling out this implementation.
 I thought that if I partition my data into the number of workers then
 performance will increase, because each worker will run a step of gradient
 descent on its partition of data. But this does not happen and each worker
 seems to process all data (if miniBatchFraction == 1.0 as in mllib's
 logisic regression implementation). For me, this doesn't make sense,
 because then only single Worker will provide the same performance. Could
 someone elaborate on this and correct me if I am wrong. How can I scale out
 the algorithm with many Workers?
 
  Best regards, Alexander
 
 
 
  --
  em rnowl...@gmail.commailto:rnowl...@gmail.com
  c 954.496.2314




 --
 em rnowl...@gmail.com
 c 954.496.2314




-- 
em rnowl...@gmail.com
c 954.496.2314


Re: Gradient descent and runMiniBatchSGD

2014-08-26 Thread Ulanov, Alexander
Hi Xiangrui,

Thanks for explanation, but I'm still missing something. In my experiments, if 
miniBatchFraction == 1.0, no matter how the data is partitioned (2, 4, 8, 16 
partitions), the algorithm executes more or less in the same time. (I have 16 
Workers). Reduce from runMiniBatchSGD takes most of the time for 2 partitions, 
mapPartitionWithIndex -- for 16. What I would expect is that the time reduces 
proportional to the number of data partitions because each partition will be 
processed on separate Worker hopefully. Why the time does not reduce?

Btw processing of one instance in my algorithm is a heavy computation, this is 
exact reason why I want to parallelize it.

Best regards, Alexander

26.08.2014, в 20:54, Xiangrui Meng 
men...@gmail.commailto:men...@gmail.com написал(а):

miniBatchFraction uses RDD.sample to get the mini-batch, and sample
still needs to visit the elements one after another. So it is not
efficient if the task is not computation heavy and this is why
setMiniBatchFraction is marked as experimental. If we can detect that
the partition iterator is backed by an ArrayBuffer, maybe we can do a
skip iterator to skip elements. -Xiangrui

On Tue, Aug 26, 2014 at 8:15 AM, Ulanov, Alexander
alexander.ula...@hp.commailto:alexander.ula...@hp.com wrote:
Hi, RJ

https://github.com/avulanov/spark/blob/neuralnetwork/mllib/src/main/scala/org/apache/spark/mllib/classification/NeuralNetwork.scala

Unit tests are in the same branch.

Alexander

From: RJ Nowling [mailto:rnowl...@gmail.com]
Sent: Tuesday, August 26, 2014 6:59 PM
To: Ulanov, Alexander
Cc: dev@spark.apache.orgmailto:dev@spark.apache.org
Subject: Re: Gradient descent and runMiniBatchSGD

Hi Alexander,

Can you post a link to the code?

RJ

On Tue, Aug 26, 2014 at 6:53 AM, Ulanov, Alexander 
alexander.ula...@hp.commailto:alexander.ula...@hp.commailto:alexander.ula...@hp.com
 wrote:
Hi,

I've implemented back propagation algorithm using Gradient class and a simple 
update using Updater class. Then I run the algorithm with mllib's 
GradientDescent class. I have troubles in scaling out this implementation. I 
thought that if I partition my data into the number of workers then performance 
will increase, because each worker will run a step of gradient descent on its 
partition of data. But this does not happen and each worker seems to process 
all data (if miniBatchFraction == 1.0 as in mllib's logisic regression 
implementation). For me, this doesn't make sense, because then only single 
Worker will provide the same performance. Could someone elaborate on this and 
correct me if I am wrong. How can I scale out the algorithm with many Workers?

Best regards, Alexander



--
em rnowl...@gmail.commailto:rnowl...@gmail.commailto:rnowl...@gmail.com
c 954.496.2314

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Handling stale PRs

2014-08-26 Thread Nicholas Chammas
By the way, as a reference point, I just stumbled across the Discourse
GitHub project and their list of pull requests
https://github.com/discourse/discourse/pulls looks pretty neat.

~2,200 closed PRs, 6 open. Least recently updated PR dates to 8 days ago.
Project started ~1.5 years ago.

Dunno how many committers Discourse has, but it looks like they've managed
their PRs well. I hope we can do as well in this regard as they have.

Nick


On Tue, Aug 26, 2014 at 2:40 PM, Josh Rosen rosenvi...@gmail.com wrote:

 Sure; App Engine supports cron and sending emails.  We can configure the
 app with Spark QA’s credentials in order to allow it to post comments on
 issues, etc.

 - Josh

 On August 26, 2014 at 11:38:08 AM, Nicholas Chammas (
 nicholas.cham...@gmail.com) wrote:

  OK, that sounds pretty cool.

 Josh,

 Do you see this app as encompassing or supplanting the functionality I
 described as well?

 Nick


 On Tue, Aug 26, 2014 at 2:21 PM, Josh Rosen rosenvi...@gmail.com wrote:

  Last weekend, I started hacking on a Google App Engine app for helping
 with pull request review (screenshot: http://i.imgur.com/wwpZKYZ.png).
  Some of my basic goals (not all implemented yet):

  - Users sign in using GitHub and can browse a list of pull requests,
 including links to associated JIRAs, Jenkins statuses, a quick preview of
 the last comment, etc.

  - Pull requests are auto-classified based on which components they
 modify (by looking at the diff).

  - From the app’s own internal database of PRs, we can build dashboards
 to find “abandoned” PRs, graph average time to first review, etc.

  - Since we authenticate users with GitHub, we can enable administrative
 functions via this dashboard (e.g. “assign this PR to me”, “vote to close
 in the weekly auto-close commit”, etc.

  Right now, I’ve implemented GItHub OAuth support and code to update the
 issues database using the GitHub API.  Because we have access to the full
 API, it’s pretty easy to do fancy things like parsing the reason for
 Jenkins failure, etc.  You could even imagine some fancy mashup tools to
 pull up JIRAs and pull requests side-by in iframes.

 After I hack on this a bit more, I plan to release a public preview
 version; if we find this tool useful, I’ll clean it up and open-source the
 app so folks can contribute to it.

 - Josh

 On August 26, 2014 at 8:16:46 AM, Nicholas Chammas (
 nicholas.cham...@gmail.com) wrote:

  On Tue, Aug 26, 2014 at 2:02 AM, Patrick Wendell pwend...@gmail.com
 wrote:

  I'd prefer if we took the approach of politely explaining why in the
  current form the patch isn't acceptable and closing it (potentially w/
 tips
  on how to improve it or narrow the scope).


 Amen to this. Aiming for such a culture would set Spark apart from other
 projects in a great way.

 I've proposed several different solutions to ASF infra to streamline the
  process, but thus far they haven't been open to any of my ideas:


 I've added myself as a watcher on those 2 INFRA issues. Sucks that the
 only
 solution on offer right now requires basically polluting the commit
 history.

 Short of moving Spark's repo to a non-ASF-managed GitHub account, do you
 think another bot could help us manage the number of stale PRs?

 I'm thinking a solution as follows might be very helpful:

 - Extend Spark QA / Jenkins to run on a weekly schedule and check for
 stale PRs. Let's say a stale PR is an open one that hasn't been updated in
 N months.
 - Spark QA maintains a list of known committers on its side.
 - During its weekly check of stale PRs, Spark QA takes the following
 action:
 - If the last person to comment on a PR was a committer, post to the
 PR asking for an update from the contributor.
 - If the last person to comment on a PR was a contributor, add the PR
 to a list. Email this list of *hanging PRs* out to the dev list on a
 weekly basis and ask committers to update them.
 - If the last person to comment on a PR was Spark QA asking the
 contributor to update it, then add the PR to a list. Email this
 list of *abandoned
 PRs* to the dev list for the record (or for closing, if that becomes
 possible in the future).

 This doesn't solve the problem of not being able to close PRs, but it does
 help make sure no PR is left hanging for long.

 What do you think? I'd be interested in implementing this solution if we
 like it.

 Nick





Re: CoHadoop Papers

2014-08-26 Thread Michael Armbrust
It seems like there are two things here:
 - Co-locating blocks with the same keys to avoid network transfer.
 - Leveraging partitioning information to avoid a shuffle when data is
already partitioned correctly (even if those partitions aren't yet on the
same machine).

The former seems more complicated and probably requires the support from
Hadoop you linked to.  However, the latter might be easier as there is
already a framework for reasoning about partitioning and the need to
shuffle in the Spark SQL planner.


On Tue, Aug 26, 2014 at 8:37 AM, Gary Malouf malouf.g...@gmail.com wrote:

 Christopher, can you expand on the co-partitioning support?

 We have a number of spark SQL tables (saved in parquet format) that all
 could be considered to have a common hash key.  Our analytics team wants to
 do frequent joins across these different data-sets based on this key.  It
 makes sense that if the data for each key across 'tables' was co-located on
 the same server, shuffles could be minimized and ultimately performance
 could be much better.

 From reading the HDFS issue I posted before, the way is being paved for
 implementing this type of behavior though there are a lot of complications
 to make it work I believe.


 On Tue, Aug 26, 2014 at 10:40 AM, Christopher Nguyen c...@adatao.com
 wrote:

  Gary, do you mean Spark and HDFS separately, or Spark's use of HDFS?
 
  If the former, Spark does support copartitioning.
 
  If the latter, it's an HDFS scope that's outside of Spark. On that note,
  Hadoop does also make attempts to collocate data, e.g., rack awareness.
 I'm
  sure the paper makes useful contributions for its set of use cases.
 
  Sent while mobile. Pls excuse typos etc.
  On Aug 26, 2014 5:21 AM, Gary Malouf malouf.g...@gmail.com wrote:
 
  It appears support for this type of control over block placement is
 going
  out in the next version of HDFS:
  https://issues.apache.org/jira/browse/HDFS-2576
 
 
  On Tue, Aug 26, 2014 at 7:43 AM, Gary Malouf malouf.g...@gmail.com
  wrote:
 
   One of my colleagues has been questioning me as to why Spark/HDFS
 makes
  no
   attempts to try to co-locate related data blocks.  He pointed to this
   paper: http://www.vldb.org/pvldb/vol4/p575-eltabakh.pdf from 2011 on
  the
   CoHadoop research and the performance improvements it yielded for
   Map/Reduce jobs.
  
   Would leveraging these ideas for writing data from Spark make sense/be
   worthwhile?
  
  
  
 
 



Re: CoHadoop Papers

2014-08-26 Thread Gary Malouf
Hi Michael,

I think once that work is into HDFS, it will be great to expose this
functionality via Spark.  This is something worth pursuing because it could
grant orders of magnitude perf improvements in cases when people need to
join data.

The second item would be very interesting, could yield significant
performance boosts.

Best,

Gary


On Tue, Aug 26, 2014 at 6:50 PM, Michael Armbrust mich...@databricks.com
wrote:

 It seems like there are two things here:
  - Co-locating blocks with the same keys to avoid network transfer.
  - Leveraging partitioning information to avoid a shuffle when data is
 already partitioned correctly (even if those partitions aren't yet on the
 same machine).

 The former seems more complicated and probably requires the support from
 Hadoop you linked to.  However, the latter might be easier as there is
 already a framework for reasoning about partitioning and the need to
 shuffle in the Spark SQL planner.


 On Tue, Aug 26, 2014 at 8:37 AM, Gary Malouf malouf.g...@gmail.com
 wrote:

 Christopher, can you expand on the co-partitioning support?

 We have a number of spark SQL tables (saved in parquet format) that all
 could be considered to have a common hash key.  Our analytics team wants
 to
 do frequent joins across these different data-sets based on this key.  It
 makes sense that if the data for each key across 'tables' was co-located
 on
 the same server, shuffles could be minimized and ultimately performance
 could be much better.

 From reading the HDFS issue I posted before, the way is being paved for
 implementing this type of behavior though there are a lot of complications
 to make it work I believe.


 On Tue, Aug 26, 2014 at 10:40 AM, Christopher Nguyen c...@adatao.com
 wrote:

  Gary, do you mean Spark and HDFS separately, or Spark's use of HDFS?
 
  If the former, Spark does support copartitioning.
 
  If the latter, it's an HDFS scope that's outside of Spark. On that note,
  Hadoop does also make attempts to collocate data, e.g., rack awareness.
 I'm
  sure the paper makes useful contributions for its set of use cases.
 
  Sent while mobile. Pls excuse typos etc.
  On Aug 26, 2014 5:21 AM, Gary Malouf malouf.g...@gmail.com wrote:
 
  It appears support for this type of control over block placement is
 going
  out in the next version of HDFS:
  https://issues.apache.org/jira/browse/HDFS-2576
 
 
  On Tue, Aug 26, 2014 at 7:43 AM, Gary Malouf malouf.g...@gmail.com
  wrote:
 
   One of my colleagues has been questioning me as to why Spark/HDFS
 makes
  no
   attempts to try to co-locate related data blocks.  He pointed to this
   paper: http://www.vldb.org/pvldb/vol4/p575-eltabakh.pdf from 2011 on
  the
   CoHadoop research and the performance improvements it yielded for
   Map/Reduce jobs.
  
   Would leveraging these ideas for writing data from Spark make
 sense/be
   worthwhile?
  
  
  
 
 





OutOfMemoryError when running sbt/sbt test

2014-08-26 Thread jay vyas
Hi spark.

I've been trying to build spark, but I've been getting lots of oome
exceptions.

https://gist.github.com/jayunit100/d424b6b825ce8517d68c

For the most part, they are of the form:

java.lang.OutOfMemoryError: unable to create new native thread

I've attempted to hard code the get_mem_opts function, which is in the
sbt-launch-lib.bash file, to
have various very high parameter sizes (i.e. -Xms5g) with high
MaxPermSize, etc... and to no avail.

Any thoughts on this would be appreciated.

I know of others having the same problem as well.

Thanks!

-- 
jay vyas


Re: OutOfMemoryError when running sbt/sbt test

2014-08-26 Thread Mubarak Seyed
What is your ulimit value?


On Tue, Aug 26, 2014 at 5:49 PM, jay vyas jayunit100.apa...@gmail.com
wrote:

 Hi spark.

 I've been trying to build spark, but I've been getting lots of oome
 exceptions.

 https://gist.github.com/jayunit100/d424b6b825ce8517d68c

 For the most part, they are of the form:

 java.lang.OutOfMemoryError: unable to create new native thread

 I've attempted to hard code the get_mem_opts function, which is in the
 sbt-launch-lib.bash file, to
 have various very high parameter sizes (i.e. -Xms5g) with high
 MaxPermSize, etc... and to no avail.

 Any thoughts on this would be appreciated.

 I know of others having the same problem as well.

 Thanks!

 --
 jay vyas



Re: spark-ec2 1.0.2 creates EC2 cluster at wrong version

2014-08-26 Thread Matei Zaharia
This shouldn't be a chicken-and-egg problem, since the script fetches the AMI 
from a known URL. Seems like an issue in publishing this release.

On August 26, 2014 at 1:24:45 PM, Shivaram Venkataraman 
(shiva...@eecs.berkeley.edu) wrote:

This is a chicken and egg problem in some sense. We can't change the ec2  
script till we have made the release and uploaded the binaries -- But once  
that is done, we can't update the script.  

I think the model we support so far is that you can launch the latest  
spark version from the master branch on github. I guess we can try to add  
something in the release process that updates the script but doesn't commit  
it ? The release managers might be able to add more.  

Thanks  
Shivaram  


On Tue, Aug 26, 2014 at 1:16 PM, Nicholas Chammas   
nicholas.cham...@gmail.com wrote:  

 I downloaded the source code release for 1.0.2 from here  
 http://spark.apache.org/downloads.html and launched an EC2 cluster using  
 spark-ec2.  
  
 After the cluster finishes launching, I fire up the shell and check the  
 version:  
  
 scala sc.version  
 res1: String = 1.0.1  
  
 The startup banner also shows the same thing. Hmm...  
  
 So I dig around and find that the spark_ec2.py script has the default Spark  
 version set to 1.0.1.  
  
 Derp.  
  
 parser.add_option(-v, --spark-version, default=1.0.1,  
 help=Version of Spark to use: 'X.Y.Z' or a specific git hash)  
  
 Is there any way to fix the release? It’s a minor issue, but could be very  
 confusing. And how can we prevent this from happening again?  
  
 Nick  
 ​  
  


Re: OutOfMemoryError when running sbt/sbt test

2014-08-26 Thread Jay Vyas
Thanks...! Some questions below.

1) you are suggesting that maybe this OOME is a symptom/red herring , and the 
true cause of it is that a thread can't span because of ulimit... If so 
possibly this could be flagged early on in the build.  And -- where are so many 
threads coming from that I need to up my limit?   Is this a new feature added 
to spark recently, and if so will it effect deployments scenarios as well?

And 

2) possibly SBT_OPTS is where the memory settings should be ? If so, then why 
do we have the get_mem_opts wrapper function coded to send memory manually as 
Xmx/Xms options?
  execRunner $java_cmd \
${SBT_OPTS:-$default_sbt_opts} \
$(get_mem_opts $sbt_mem) \
${java_opts} \
${java_args[@]} \
-jar $sbt_jar \
${sbt_commands[@]} \
${residual_args[@]}



 On Aug 26, 2014, at 8:58 PM, Mubarak Seyed spark.devu...@gmail.com wrote:
 
 What is your ulimit value?
 
 
 On Tue, Aug 26, 2014 at 5:49 PM, jay vyas jayunit100.apa...@gmail.com 
 wrote:
 Hi spark.
 
 I've been trying to build spark, but I've been getting lots of oome
 exceptions.
 
 https://gist.github.com/jayunit100/d424b6b825ce8517d68c
 
 For the most part, they are of the form:
 
 java.lang.OutOfMemoryError: unable to create new native thread
 
 I've attempted to hard code the get_mem_opts function, which is in the
 sbt-launch-lib.bash file, to
 have various very high parameter sizes (i.e. -Xms5g) with high
 MaxPermSize, etc... and to no avail.
 
 Any thoughts on this would be appreciated.
 
 I know of others having the same problem as well.
 
 Thanks!
 
 --
 jay vyas
 


Re: OutOfMemoryError when running sbt/sbt test

2014-08-26 Thread Anand Avati
Hi Jay,
The recommended way to build spark from source is through the maven system.
You would want to follow the steps in
https://spark.apache.org/docs/latest/building-with-maven.html to set the
MAVEN_OPTS to prevent OOM build errors.

Thanks


On Tue, Aug 26, 2014 at 5:49 PM, jay vyas jayunit100.apa...@gmail.com
wrote:

 Hi spark.

 I've been trying to build spark, but I've been getting lots of oome
 exceptions.

 https://gist.github.com/jayunit100/d424b6b825ce8517d68c

 For the most part, they are of the form:

 java.lang.OutOfMemoryError: unable to create new native thread

 I've attempted to hard code the get_mem_opts function, which is in the
 sbt-launch-lib.bash file, to
 have various very high parameter sizes (i.e. -Xms5g) with high
 MaxPermSize, etc... and to no avail.

 Any thoughts on this would be appreciated.

 I know of others having the same problem as well.

 Thanks!

 --
 jay vyas



Re: Handling stale PRs

2014-08-26 Thread Madhu
Nicholas Chammas wrote
 Dunno how many committers Discourse has, but it looks like they've managed
 their PRs well. I hope we can do as well in this regard as they have.

Discourse developers appear to  eat their own dog food
https://meta.discourse.org  .
Improved collaboration and a shared vision might be a reason for their
success.




-
--
Madhu
https://www.linkedin.com/in/msiddalingaiah
--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Handling-stale-PRs-tp8015p8061.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: spark-ec2 1.0.2 creates EC2 cluster at wrong version

2014-08-26 Thread Tathagata Das
Yes, this was an oversight on my part. I have opened a JIRA for this.
https://issues.apache.org/jira/browse/SPARK-3242

For the time being the workaround should be providing the version 1.0.2
explicitly as part of the script.

TD


On Tue, Aug 26, 2014 at 6:39 PM, Matei Zaharia matei.zaha...@gmail.com
wrote:

 This shouldn't be a chicken-and-egg problem, since the script fetches the
 AMI from a known URL. Seems like an issue in publishing this release.

 On August 26, 2014 at 1:24:45 PM, Shivaram Venkataraman (
 shiva...@eecs.berkeley.edu) wrote:

 This is a chicken and egg problem in some sense. We can't change the ec2
 script till we have made the release and uploaded the binaries -- But once
 that is done, we can't update the script.

 I think the model we support so far is that you can launch the latest
 spark version from the master branch on github. I guess we can try to add
 something in the release process that updates the script but doesn't commit
 it ? The release managers might be able to add more.

 Thanks
 Shivaram


 On Tue, Aug 26, 2014 at 1:16 PM, Nicholas Chammas 
 nicholas.cham...@gmail.com wrote:

  I downloaded the source code release for 1.0.2 from here
  http://spark.apache.org/downloads.html and launched an EC2 cluster
 using
  spark-ec2.
 
  After the cluster finishes launching, I fire up the shell and check the
  version:
 
  scala sc.version
  res1: String = 1.0.1
 
  The startup banner also shows the same thing. Hmm...
 
  So I dig around and find that the spark_ec2.py script has the default
 Spark
  version set to 1.0.1.
 
  Derp.
 
  parser.add_option(-v, --spark-version, default=1.0.1,
  help=Version of Spark to use: 'X.Y.Z' or a specific git hash)
 
  Is there any way to fix the release? It’s a minor issue, but could be
 very
  confusing. And how can we prevent this from happening again?
 
  Nick
  ​