[jira] [Updated] (SPARK-6101) Create a SparkSQL DataSource API implementation for DynamoDB

2015-08-09 Thread Chris Fregly (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Fregly updated SPARK-6101:

Fix Version/s: (was: 1.5.0)
   1.6.0
  Description: 
similar to https://github.com/databricks/spark-avro  and 
https://github.com/databricks/spark-csv

Here's a good basis for a java-based, high-level dynamodb java connector:  
https://github.com/sporcina/dynamodb-connector/

  was:similar to https://github.com/databricks/spark-avro  and 
https://github.com/databricks/spark-csv


 Create a SparkSQL DataSource API implementation for DynamoDB
 

 Key: SPARK-6101
 URL: https://issues.apache.org/jira/browse/SPARK-6101
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.2.0
Reporter: Chris Fregly
Assignee: Chris Fregly
 Fix For: 1.6.0


 similar to https://github.com/databricks/spark-avro  and 
 https://github.com/databricks/spark-csv
 Here's a good basis for a java-based, high-level dynamodb java connector:  
 https://github.com/sporcina/dynamodb-connector/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4144) Support incremental model training of Naive Bayes classifier

2015-07-14 Thread Chris Fregly (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14627045#comment-14627045
 ] 

Chris Fregly commented on SPARK-4144:
-

@[~freeman-lab]:

looks like this is still open.  any chance I can take it back?

did you make any progress that you'd like to share?

let me know.  i'd love to help here.

 Support incremental model training of Naive Bayes classifier
 

 Key: SPARK-4144
 URL: https://issues.apache.org/jira/browse/SPARK-4144
 Project: Spark
  Issue Type: Improvement
  Components: MLlib, Streaming
Reporter: Chris Fregly
Assignee: Jeremy Freeman

 Per Xiangrui Meng from the following user list discussion:  
 http://mail-archives.apache.org/mod_mbox/spark-user/201408.mbox/%3CCAJgQjQ_QjMGO=jmm8weq1v8yqfov8du03abzy7eeavgjrou...@mail.gmail.com%3E

 For Naive Bayes, we need to update the priors and conditional
 probabilities, which means we should also remember the number of
 observations for the updates.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8550) table() no longer supports specifying the database - ie. table([database].[table]).

2015-06-22 Thread Chris Fregly (JIRA)
Chris Fregly created SPARK-8550:
---

 Summary: table() no longer supports specifying the database - ie. 
table([database].[table]).
 Key: SPARK-8550
 URL: https://issues.apache.org/jira/browse/SPARK-8550
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.0
Reporter: Chris Fregly


this is a regression from 1.3.
the workaround is to use sql(SELECT * FROM [database].[table]) for now.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6101) Create a SparkSQL DataSource API implementation for DynamoDB

2015-05-28 Thread Chris Fregly (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Fregly updated SPARK-6101:

Description: similar to https://github.com/databricks/spark-avro  and 
https://github.com/databricks/spark-csv  (was: similar to 
https://github.com/databricks/spark-avro)

 Create a SparkSQL DataSource API implementation for DynamoDB
 

 Key: SPARK-6101
 URL: https://issues.apache.org/jira/browse/SPARK-6101
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.2.0
Reporter: Chris Fregly
Assignee: Chris Fregly
 Fix For: 1.5.0


 similar to https://github.com/databricks/spark-avro  and 
 https://github.com/databricks/spark-csv



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6654) Update Kinesis Streaming impls (both KCL-based and Direct) to use latest aws-java-sdk and kinesis-client-library

2015-05-17 Thread Chris Fregly (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Fregly updated SPARK-6654:

Target Version/s: 1.4.0  (was: 1.5.0)

 Update Kinesis Streaming impls (both KCL-based and Direct) to use latest 
 aws-java-sdk and kinesis-client-library
 

 Key: SPARK-6654
 URL: https://issues.apache.org/jira/browse/SPARK-6654
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.1.0
Reporter: Chris Fregly





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6654) Update Kinesis Streaming impls (both KCL-based and Direct) to use latest aws-java-sdk and kinesis-client-library

2015-05-17 Thread Chris Fregly (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Fregly resolved SPARK-6654.
-
   Resolution: Duplicate
Fix Version/s: 1.4.0

duplicate of SPARK-7679

 Update Kinesis Streaming impls (both KCL-based and Direct) to use latest 
 aws-java-sdk and kinesis-client-library
 

 Key: SPARK-6654
 URL: https://issues.apache.org/jira/browse/SPARK-6654
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.1.0
Reporter: Chris Fregly
 Fix For: 1.4.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6101) Create a SparkSQL DataSource API implementation for DynamoDB

2015-05-09 Thread Chris Fregly (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Fregly updated SPARK-6101:

Fix Version/s: (was: 1.4.0)
   1.5.0

 Create a SparkSQL DataSource API implementation for DynamoDB
 

 Key: SPARK-6101
 URL: https://issues.apache.org/jira/browse/SPARK-6101
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.2.0
Reporter: Chris Fregly
Assignee: Chris Fregly
 Fix For: 1.5.0


 similar to https://github.com/databricks/spark-avro



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-4184) Improve Spark Streaming documentation to address commonly-asked questions

2015-05-03 Thread Chris Fregly (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Fregly closed SPARK-4184.
---
Resolution: Duplicate

we'll incorporate changes in incrementally

 Improve Spark Streaming documentation to address commonly-asked questions 
 --

 Key: SPARK-4184
 URL: https://issues.apache.org/jira/browse/SPARK-4184
 Project: Spark
  Issue Type: Documentation
  Components: Streaming
Reporter: Chris Fregly
  Labels: documentation, streaming

 Improve Streaming documentation including API descriptions, 
 concurrency/thread safety, fault tolerance, replication, checkpointing, 
 scalability, resource allocation and utilization, back pressure, and 
 monitoring.
 also, add a section to the kinesis streaming guide describing how to use IAM 
 roles with the Spark Kinesis Receiver.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6654) Update Kinesis Streaming impls (both KCL-based and Direct) to use latest aws-java-sdk and kinesis-client-library

2015-05-03 Thread Chris Fregly (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Fregly updated SPARK-6654:

Priority: Major  (was: Blocker)
Target Version/s: 1.5.0  (was: 1.4.0)

 Update Kinesis Streaming impls (both KCL-based and Direct) to use latest 
 aws-java-sdk and kinesis-client-library
 

 Key: SPARK-6654
 URL: https://issues.apache.org/jira/browse/SPARK-6654
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.1.0
Reporter: Chris Fregly





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7178) Improve DataFrame documentation and code samples

2015-04-30 Thread Chris Fregly (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14522351#comment-14522351
 ] 

Chris Fregly commented on SPARK-7178:
-

fillNa() is also commonly used:

https://forums.databricks.com/questions/790/how-do-i-replace-nulls-with-0s-in-a-dataframe.html

 Improve DataFrame documentation and code samples
 

 Key: SPARK-7178
 URL: https://issues.apache.org/jira/browse/SPARK-7178
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.1
Reporter: Chris Fregly
  Labels: dataframe

 AND and OR are not straightforward when using the new DataFrame API.
 the current convention - accepted by Pandas users - is to use the bitwise  
 and | instead of AND and OR.  when using these, however, you need to wrap 
 each expression in parenthesis to prevent the bitwise operator from 
 dominating.
 also, working with StructTypes is a bit confusing.  the following link:  
 https://spark.apache.org/docs/latest/sql-programming-guide.html#programmatically-specifying-the-schema
  (Python tab) implies that you can work with tuples directly when creating a 
 DataFrame.
 however, the following code errors out unless we explicitly use Row's:
 {code}
 from pyspark.sql import Row
 from pyspark.sql.types import *
 # The schema is encoded in a string.
 schemaString = a
 fields = [StructField(field_name, MapType(StringType(),IntegerType())) for 
 field_name in schemaString.split()]
 schema = StructType(fields)
 df = sqlContext.createDataFrame([Row(a={'b': 1})], schema)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-7178) Improve DataFrame documentation and code samples

2015-04-28 Thread Chris Fregly (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14517858#comment-14517858
 ] 

Chris Fregly edited comment on SPARK-7178 at 4/28/15 8:07 PM:
--

added these to the forums

AND and OR:  
https://forums.databricks.com/questions/758/how-do-i-use-and-and-or-within-my-dataframe-operat.html

Nested Map Columns in DataFrames:
https://forums.databricks.com/questions/764/how-do-i-create-a-dataframe-with-nested-map-column.html

Casting columns of DataFrames:
https://forums.databricks.com/questions/767/how-do-i-cast-within-a-dataframe.html


was (Author: cfregly):
added this to the forums to address the AND and OR:  
https://forums.databricks.com/questions/758/how-do-i-use-and-and-or-within-my-dataframe-operat.html

 Improve DataFrame documentation and code samples
 

 Key: SPARK-7178
 URL: https://issues.apache.org/jira/browse/SPARK-7178
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.1
Reporter: Chris Fregly
  Labels: dataframe

 AND and OR are not straightforward when using the new DataFrame API.
 the current convention - accepted by Pandas users - is to use the bitwise  
 and | instead of AND and OR.  when using these, however, you need to wrap 
 each expression in parenthesis to prevent the bitwise operator from 
 dominating.
 also, working with StructTypes is a bit confusing.  the following link:  
 https://spark.apache.org/docs/latest/sql-programming-guide.html#programmatically-specifying-the-schema
  (Python tab) implies that you can work with tuples directly when creating a 
 DataFrame.
 however, the following code errors out unless we explicitly use Row's:
 {code}
 from pyspark.sql import Row
 from pyspark.sql.types import *
 # The schema is encoded in a string.
 schemaString = a
 fields = [StructField(field_name, MapType(StringType(),IntegerType())) for 
 field_name in schemaString.split()]
 schema = StructType(fields)
 df = sqlContext.createDataFrame([Row(a={'b': 1})], schema)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7178) Improve DataFrame documentation and code samples

2015-04-28 Thread Chris Fregly (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14517858#comment-14517858
 ] 

Chris Fregly commented on SPARK-7178:
-

added this to the forums to address the AND and OR:  
https://forums.databricks.com/questions/758/how-do-i-use-and-and-or-within-my-dataframe-operat.html

 Improve DataFrame documentation and code samples
 

 Key: SPARK-7178
 URL: https://issues.apache.org/jira/browse/SPARK-7178
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.1
Reporter: Chris Fregly
  Labels: dataframe

 AND and OR are not straightforward when using the new DataFrame API.
 the current convention - accepted by Pandas users - is to use the bitwise  
 and | instead of AND and OR.  when using these, however, you need to wrap 
 each expression in parenthesis to prevent the bitwise operator from 
 dominating.
 also, working with StructTypes is a bit confusing.  the following link:  
 https://spark.apache.org/docs/latest/sql-programming-guide.html#programmatically-specifying-the-schema
  (Python tab) implies that you can work with tuples directly when creating a 
 DataFrame.
 however, the following code errors out unless we explicitly use Row's:
 {code}
 from pyspark.sql import Row
 from pyspark.sql.types import *
 # The schema is encoded in a string.
 schemaString = a
 fields = [StructField(field_name, MapType(StringType(),IntegerType())) for 
 field_name in schemaString.split()]
 schema = StructType(fields)
 df = sqlContext.createDataFrame([Row(a={'b': 1})], schema)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7178) Improve DataFrame documentation to include common uses like AND and OR semantics within filters, etc

2015-04-27 Thread Chris Fregly (JIRA)
Chris Fregly created SPARK-7178:
---

 Summary: Improve DataFrame documentation to include common uses 
like AND and OR semantics within filters, etc
 Key: SPARK-7178
 URL: https://issues.apache.org/jira/browse/SPARK-7178
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.1
Reporter: Chris Fregly


AND and OR are not straightforward when using the new DataFrame API.

the current convention - accepted by Pandas users - is to use the bitwise  and 
| instead of AND and OR.  when using these, however, you need to wrap each 
expression in parenthesis to prevent the bitwise operator from dominating.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7178) Improve DataFrame documentation to include common uses like AND and OR semantics within filters, etc

2015-04-27 Thread Chris Fregly (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14516021#comment-14516021
 ] 

Chris Fregly commented on SPARK-7178:
-

i recommend updating both the scala docs and the SQL Programming guide.

 Improve DataFrame documentation to include common uses like AND and OR 
 semantics within filters, etc
 

 Key: SPARK-7178
 URL: https://issues.apache.org/jira/browse/SPARK-7178
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.1
Reporter: Chris Fregly
  Labels: dataframe

 AND and OR are not straightforward when using the new DataFrame API.
 the current convention - accepted by Pandas users - is to use the bitwise  
 and | instead of AND and OR.  when using these, however, you need to wrap 
 each expression in parenthesis to prevent the bitwise operator from 
 dominating.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7178) Improve DataFrame documentation to include common uses like AND and OR semantics within filters, etc

2015-04-27 Thread Chris Fregly (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14516024#comment-14516024
 ] 

Chris Fregly commented on SPARK-7178:
-

cc'ing [~rxin]

 Improve DataFrame documentation to include common uses like AND and OR 
 semantics within filters, etc
 

 Key: SPARK-7178
 URL: https://issues.apache.org/jira/browse/SPARK-7178
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.1
Reporter: Chris Fregly
  Labels: dataframe

 AND and OR are not straightforward when using the new DataFrame API.
 the current convention - accepted by Pandas users - is to use the bitwise  
 and | instead of AND and OR.  when using these, however, you need to wrap 
 each expression in parenthesis to prevent the bitwise operator from 
 dominating.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7178) Improve DataFrame documentation and code samples

2015-04-27 Thread Chris Fregly (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14516095#comment-14516095
 ] 

Chris Fregly commented on SPARK-7178:
-


{code}
from pyspark.sql import Row
from pyspark.sql.types import *

# The schema is encoded in a string.
schemaString = a 

fields = [StructField(field_name, MapType(StringType(),IntegerType())) for 
field_name in schemaString.split()]
schema = StructType(fields)

df = sqlContext.createDataFrame([Row(a={'b': 1})], schema)
{code}

is equivalent to (without schema)

{code}
df2 = sqlContext.createDataFrame([{'a':{'b': 1}}])
{code}

but this isn't clear in the docs as far as i can tell

 Improve DataFrame documentation and code samples
 

 Key: SPARK-7178
 URL: https://issues.apache.org/jira/browse/SPARK-7178
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.1
Reporter: Chris Fregly
  Labels: dataframe

 AND and OR are not straightforward when using the new DataFrame API.
 the current convention - accepted by Pandas users - is to use the bitwise  
 and | instead of AND and OR.  when using these, however, you need to wrap 
 each expression in parenthesis to prevent the bitwise operator from 
 dominating.
 also, working with StructTypes is a bit confusing.  the following link:  
 https://spark.apache.org/docs/latest/sql-programming-guide.html#programmatically-specifying-the-schema
  (Python tab) implies that you can work with tuples directly when creating a 
 DataFrame.
 however, the following code errors out unless we explicitly use Row's:
 {code}
 from pyspark.sql import Row
 from pyspark.sql.types import *
 # The schema is encoded in a string.
 schemaString = a
 fields = [StructField(field_name, MapType(StringType(),IntegerType())) for 
 field_name in schemaString.split()]
 schema = StructType(fields)
 df = sqlContext.createDataFrame([Row(a={'b': 1})], schema)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7178) Improve DataFrame documentation and code samples

2015-04-27 Thread Chris Fregly (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Fregly updated SPARK-7178:

Description: 
AND and OR are not straightforward when using the new DataFrame API.

the current convention - accepted by Pandas users - is to use the bitwise  and 
| instead of AND and OR.  when using these, however, you need to wrap each 
expression in parenthesis to prevent the bitwise operator from dominating.

also, working with StructTypes is a bit confusing.  the following link:  
https://spark.apache.org/docs/latest/sql-programming-guide.html#programmatically-specifying-the-schema
 (Python tab) implies that you can work with tuples directly when creating a 
DataFrame.

however, the following code errors out unless we explicitly use Row's:

{code}
from pyspark.sql import Row
from pyspark.sql.types import *

# The schema is encoded in a string.
schemaString = a

fields = [StructField(field_name, MapType(StringType(),IntegerType())) for 
field_name in schemaString.split()]
schema = StructType(fields)

df = sqlContext.createDataFrame([Row(a={'b': 1})], schema)
{code}


  was:
AND and OR are not straightforward when using the new DataFrame API.

the current convention - accepted by Pandas users - is to use the bitwise  and 
| instead of AND and OR.  when using these, however, you need to wrap each 
expression in parenthesis to prevent the bitwise operator from dominating.

also, working with StructTypes is a bit confusing.  the following link:  
https://spark.apache.org/docs/latest/sql-programming-guide.html#programmatically-specifying-the-schema
 (Python tab) implies that you can work with tuples directly when creating a 
DataFrame.

however, the following code errors out unless we explicitly use Row's:

{code}
fields = [StructField(field_name, MapType(StringType(),IntegerType())) for 
field_name in schemaString.split()]
schema = StructType(fields)
df = sqlContext.createDataFrame([Row(a={'b': 1})], schema)
{code}



 Improve DataFrame documentation and code samples
 

 Key: SPARK-7178
 URL: https://issues.apache.org/jira/browse/SPARK-7178
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.1
Reporter: Chris Fregly
  Labels: dataframe

 AND and OR are not straightforward when using the new DataFrame API.
 the current convention - accepted by Pandas users - is to use the bitwise  
 and | instead of AND and OR.  when using these, however, you need to wrap 
 each expression in parenthesis to prevent the bitwise operator from 
 dominating.
 also, working with StructTypes is a bit confusing.  the following link:  
 https://spark.apache.org/docs/latest/sql-programming-guide.html#programmatically-specifying-the-schema
  (Python tab) implies that you can work with tuples directly when creating a 
 DataFrame.
 however, the following code errors out unless we explicitly use Row's:
 {code}
 from pyspark.sql import Row
 from pyspark.sql.types import *
 # The schema is encoded in a string.
 schemaString = a
 fields = [StructField(field_name, MapType(StringType(),IntegerType())) for 
 field_name in schemaString.split()]
 schema = StructType(fields)
 df = sqlContext.createDataFrame([Row(a={'b': 1})], schema)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7178) Improve DataFrame documentation and code samples

2015-04-27 Thread Chris Fregly (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Fregly updated SPARK-7178:

Summary: Improve DataFrame documentation and code samples  (was: Improve 
DataFrame documentation to include common uses like AND and OR semantics within 
filters, etc)

 Improve DataFrame documentation and code samples
 

 Key: SPARK-7178
 URL: https://issues.apache.org/jira/browse/SPARK-7178
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.1
Reporter: Chris Fregly
  Labels: dataframe

 AND and OR are not straightforward when using the new DataFrame API.
 the current convention - accepted by Pandas users - is to use the bitwise  
 and | instead of AND and OR.  when using these, however, you need to wrap 
 each expression in parenthesis to prevent the bitwise operator from 
 dominating.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7178) Improve DataFrame documentation and code samples

2015-04-27 Thread Chris Fregly (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Fregly updated SPARK-7178:

Description: 
AND and OR are not straightforward when using the new DataFrame API.

the current convention - accepted by Pandas users - is to use the bitwise  and 
| instead of AND and OR.  when using these, however, you need to wrap each 
expression in parenthesis to prevent the bitwise operator from dominating.

also, it's a bit confusing when creating a 

  was:
AND and OR are not straightforward when using the new DataFrame API.

the current convention - accepted by Pandas users - is to use the bitwise  and 
| instead of AND and OR.  when using these, however, you need to wrap each 
expression in parenthesis to prevent the bitwise operator from dominating.


 Improve DataFrame documentation and code samples
 

 Key: SPARK-7178
 URL: https://issues.apache.org/jira/browse/SPARK-7178
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.1
Reporter: Chris Fregly
  Labels: dataframe

 AND and OR are not straightforward when using the new DataFrame API.
 the current convention - accepted by Pandas users - is to use the bitwise  
 and | instead of AND and OR.  when using these, however, you need to wrap 
 each expression in parenthesis to prevent the bitwise operator from 
 dominating.
 also, it's a bit confusing when creating a 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7178) Improve DataFrame documentation and code samples

2015-04-27 Thread Chris Fregly (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14516070#comment-14516070
 ] 

Chris Fregly commented on SPARK-7178:
-

cc'ing [~joshrosen]

just talked to him about this as well.

 Improve DataFrame documentation and code samples
 

 Key: SPARK-7178
 URL: https://issues.apache.org/jira/browse/SPARK-7178
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.1
Reporter: Chris Fregly
  Labels: dataframe

 AND and OR are not straightforward when using the new DataFrame API.
 the current convention - accepted by Pandas users - is to use the bitwise  
 and | instead of AND and OR.  when using these, however, you need to wrap 
 each expression in parenthesis to prevent the bitwise operator from 
 dominating.
 also, working with StructTypes is a bit confusing.  the following link:  
 https://spark.apache.org/docs/latest/sql-programming-guide.html#programmatically-specifying-the-schema
  (Python tab) implies that you can work with tuples directly when creating a 
 DataFrame.
 however, the following code errors out unless we explicitly use Row's:
 {code}
 fields = [StructField(field_name, MapType(StringType(),IntegerType())) for 
 field_name in schemaString.split()]
 schema = StructType(fields)
 df = sqlContext.createDataFrame([Row(a={'b': 1})], schema)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-7178) Improve DataFrame documentation and code samples

2015-04-27 Thread Chris Fregly (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14516021#comment-14516021
 ] 

Chris Fregly edited comment on SPARK-7178 at 4/28/15 12:46 AM:
---

i recommend updating all of the following:
1)  scala/python/pyspark docs 
(ie. 
https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.SQLContext.createDataFrame)
2)  SQL Programming guide 
(ie. https://spark.apache.org/docs/latest/sql-programming-guide.html)



was (Author: cfregly):
i recommend updating both the scala docs and the SQL Programming guide.

 Improve DataFrame documentation and code samples
 

 Key: SPARK-7178
 URL: https://issues.apache.org/jira/browse/SPARK-7178
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.1
Reporter: Chris Fregly
  Labels: dataframe

 AND and OR are not straightforward when using the new DataFrame API.
 the current convention - accepted by Pandas users - is to use the bitwise  
 and | instead of AND and OR.  when using these, however, you need to wrap 
 each expression in parenthesis to prevent the bitwise operator from 
 dominating.
 also, it's a bit confusing when creating a 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7178) Improve DataFrame documentation and code samples

2015-04-27 Thread Chris Fregly (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Fregly updated SPARK-7178:

Description: 
AND and OR are not straightforward when using the new DataFrame API.

the current convention - accepted by Pandas users - is to use the bitwise  and 
| instead of AND and OR.  when using these, however, you need to wrap each 
expression in parenthesis to prevent the bitwise operator from dominating.

also, working with StructTypes is a bit confusing.  the following link:  
https://spark.apache.org/docs/latest/sql-programming-guide.html#programmatically-specifying-the-schema
 (Python tab) implies that you can work with tuples directly when creating a 
DataFrame.

however, the following code errors out unless we explicitly use Row's:

{code}
fields = [StructField(field_name, MapType(StringType(),IntegerType())) for 
field_name in schemaString.split()]
schema = StructType(fields)
df = sqlContext.createDataFrame([Row(a={'b': 1})], schema)
{code}


  was:
AND and OR are not straightforward when using the new DataFrame API.

the current convention - accepted by Pandas users - is to use the bitwise  and 
| instead of AND and OR.  when using these, however, you need to wrap each 
expression in parenthesis to prevent the bitwise operator from dominating.

also, it's a bit confusing when creating a 


 Improve DataFrame documentation and code samples
 

 Key: SPARK-7178
 URL: https://issues.apache.org/jira/browse/SPARK-7178
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.1
Reporter: Chris Fregly
  Labels: dataframe

 AND and OR are not straightforward when using the new DataFrame API.
 the current convention - accepted by Pandas users - is to use the bitwise  
 and | instead of AND and OR.  when using these, however, you need to wrap 
 each expression in parenthesis to prevent the bitwise operator from 
 dominating.
 also, working with StructTypes is a bit confusing.  the following link:  
 https://spark.apache.org/docs/latest/sql-programming-guide.html#programmatically-specifying-the-schema
  (Python tab) implies that you can work with tuples directly when creating a 
 DataFrame.
 however, the following code errors out unless we explicitly use Row's:
 {code}
 fields = [StructField(field_name, MapType(StringType(),IntegerType())) for 
 field_name in schemaString.split()]
 schema = StructType(fields)
 df = sqlContext.createDataFrame([Row(a={'b': 1})], schema)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6514) For Kinesis Streaming, use the same region for DynamoDB (KCL checkpoints) as the Kinesis stream itself

2015-04-14 Thread Chris Fregly (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14495613#comment-14495613
 ] 

Chris Fregly commented on SPARK-6514:
-

[~tdas] can you take a look at this PR when you get a chance?

 For Kinesis Streaming, use the same region for DynamoDB (KCL checkpoints) as 
 the Kinesis stream itself  
 

 Key: SPARK-6514
 URL: https://issues.apache.org/jira/browse/SPARK-6514
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.3.0
Reporter: Chris Fregly

 context:  i started the original Kinesis impl with KCL 1.0 (not supported), 
 then finished on KCL 1.1 (supported) without realizing that it's supported.
 also, we should upgrade to the latest Kinesis Client Library (KCL) which is 
 currently v1.2 right now, i believe.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6514) For Kinesis Streaming, use the same region for DynamoDB (KCL checkpoints) as the Kinesis stream itself

2015-04-14 Thread Chris Fregly (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14495607#comment-14495607
 ] 

Chris Fregly commented on SPARK-6514:
-

https://github.com/apache/spark/pull/5375

 For Kinesis Streaming, use the same region for DynamoDB (KCL checkpoints) as 
 the Kinesis stream itself  
 

 Key: SPARK-6514
 URL: https://issues.apache.org/jira/browse/SPARK-6514
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.3.0
Reporter: Chris Fregly

 context:  i started the original Kinesis impl with KCL 1.0 (not supported), 
 then finished on KCL 1.1 (supported) without realizing that it's supported.
 also, we should upgrade to the latest Kinesis Client Library (KCL) which is 
 currently v1.2 right now, i believe.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5960) Allow AWS credentials to be passed to KinesisUtils.createStream()

2015-04-14 Thread Chris Fregly (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14495599#comment-14495599
 ] 

Chris Fregly commented on SPARK-5960:
-

[~tdas] can you take a look at this PR real quick?

 Allow AWS credentials to be passed to KinesisUtils.createStream()
 -

 Key: SPARK-5960
 URL: https://issues.apache.org/jira/browse/SPARK-5960
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.1.0
Reporter: Chris Fregly
Assignee: Chris Fregly

 While IAM roles are preferable, we're seeing a lot of cases where we need to 
 pass AWS credentials when creating the KinesisReceiver.
 Notes:
 * Make sure we don't log the credentials anywhere
 * Maintain compatibility with existing KinesisReceiver-based code.
  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6514) For Kinesis Streaming, use the same region for DynamoDB (KCL checkpoints) as the Kinesis stream itself

2015-04-14 Thread Chris Fregly (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14495605#comment-14495605
 ] 

Chris Fregly commented on SPARK-6514:
-

hey pawel!

i think keeping it regionName is fine.  i just reviewed the KCL code and docs 
again - and realized that they also use this regionName for CloudWatch.

and by exposing the implementation, i meant that calling something Dynamo 
Region would be awkward if AWS ever changes the implementation to be something 
other than Dynamo.  that's the level of implementation that i was referring to.

also, if we ever move off of KCL - or no longer need to set this region for 
whatever reason - it may bite us later.  just something to think about.

lastly, i think it's OK not to extract the region from the stream URL - 
especially if we make regionName a required field on the API.  if it's 
optional, however, i would use the one in the stream location.  this may be 
overly complicated.  you're call.



 For Kinesis Streaming, use the same region for DynamoDB (KCL checkpoints) as 
 the Kinesis stream itself  
 

 Key: SPARK-6514
 URL: https://issues.apache.org/jira/browse/SPARK-6514
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.3.0
Reporter: Chris Fregly

 context:  i started the original Kinesis impl with KCL 1.0 (not supported), 
 then finished on KCL 1.1 (supported) without realizing that it's supported.
 also, we should upgrade to the latest Kinesis Client Library (KCL) which is 
 currently v1.2 right now, i believe.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6599) Improve reliability and usability of Kinesis-based Spark Streaming

2015-04-06 Thread Chris Fregly (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Fregly updated SPARK-6599:

Summary: Improve reliability and usability of Kinesis-based Spark Streaming 
 (was: Add Kinesis Direct API)

 Improve reliability and usability of Kinesis-based Spark Streaming
 --

 Key: SPARK-6599
 URL: https://issues.apache.org/jira/browse/SPARK-6599
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Reporter: Tathagata Das
Assignee: Tathagata Das





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6514) For Kinesis Streaming, use the same region for DynamoDB (KCL checkpoints) as the Kinesis stream itself

2015-04-06 Thread Chris Fregly (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Fregly updated SPARK-6514:

Target Version/s: 1.4.0  (was: 1.3.1)

 For Kinesis Streaming, use the same region for DynamoDB (KCL checkpoints) as 
 the Kinesis stream itself  
 

 Key: SPARK-6514
 URL: https://issues.apache.org/jira/browse/SPARK-6514
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.3.0
Reporter: Chris Fregly

 this was not supported when i originally wrote this receiver.
 this is now supported.  also, upgrade to the latest Kinesis Client Library 
 (KCL) which is 1.2, i believe.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6407) Streaming ALS for Collaborative Filtering

2015-04-02 Thread Chris Fregly (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14392646#comment-14392646
 ] 

Chris Fregly commented on SPARK-6407:
-

from [~mengxr] 

The online update should be implemented with GraphX or indexedrdd, 
which may take some time. There is no open-source solution.

Try doing a survey on existing algorithms for online matrix 
factorization updates.

 Streaming ALS for Collaborative Filtering
 -

 Key: SPARK-6407
 URL: https://issues.apache.org/jira/browse/SPARK-6407
 Project: Spark
  Issue Type: New Feature
  Components: Streaming
Reporter: Felix Cheung
Priority: Minor

 Like MLLib's ALS implementation for recommendation, and applying to streaming.
 Similar to streaming linear regression, logistic regression, could we apply 
 gradient updates to batches of data and reuse existing MLLib implementation?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5960) Allow AWS credentials to be passed to KinesisUtils.createStream()

2015-04-01 Thread Chris Fregly (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Fregly updated SPARK-5960:

Target Version/s: 1.4.0  (was: 1.3.1)

 Allow AWS credentials to be passed to KinesisUtils.createStream()
 -

 Key: SPARK-5960
 URL: https://issues.apache.org/jira/browse/SPARK-5960
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.1.0
Reporter: Chris Fregly
Assignee: Chris Fregly

 While IAM roles are preferable, we're seeing a lot of cases where we need to 
 pass AWS credentials when creating the KinesisReceiver.
 Notes:
 * Make sure we don't log the credentials anywhere
 * Maintain compatibility with existing KinesisReceiver-based code.
  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6654) Update Kinesis Streaming impls (both KCL-based and Direct) to use latest aws-java-sdk and kinesis-client-library

2015-04-01 Thread Chris Fregly (JIRA)
Chris Fregly created SPARK-6654:
---

 Summary: Update Kinesis Streaming impls (both KCL-based and 
Direct) to use latest aws-java-sdk and kinesis-client-library
 Key: SPARK-6654
 URL: https://issues.apache.org/jira/browse/SPARK-6654
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.1.0
Reporter: Chris Fregly






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6656) Allow the application name to be passed in versus pulling from SparkContext.getAppName()

2015-04-01 Thread Chris Fregly (JIRA)
Chris Fregly created SPARK-6656:
---

 Summary: Allow the application name to be passed in versus pulling 
from SparkContext.getAppName() 
 Key: SPARK-6656
 URL: https://issues.apache.org/jira/browse/SPARK-6656
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.1.0
Reporter: Chris Fregly


this is useful for the scenario where Kinesis Spark Streaming is being invoked 
from the Spark Shell.  in this case, the application name in the SparkContext 
is pre-set to Spark Shell.

this isn't a common or recommended use case, but it's best to make this 
configurable outside of SparkContext.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4184) Improve Spark Streaming documentation to address commonly-asked questions

2015-04-01 Thread Chris Fregly (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Fregly updated SPARK-4184:

Target Version/s: 1.4.0  (was: 1.3.1)

 Improve Spark Streaming documentation to address commonly-asked questions 
 --

 Key: SPARK-4184
 URL: https://issues.apache.org/jira/browse/SPARK-4184
 Project: Spark
  Issue Type: Documentation
  Components: Streaming
Reporter: Chris Fregly
  Labels: documentation, streaming

 Improve Streaming documentation including API descriptions, 
 concurrency/thread safety, fault tolerance, replication, checkpointing, 
 scalability, resource allocation and utilization, back pressure, and 
 monitoring.
 also, add a section to the kinesis streaming guide describing how to use IAM 
 roles with the Spark Kinesis Receiver.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6514) For Kinesis Streaming, use the same region for DynamoDB (KCL checkpoints) as the Kinesis stream itself

2015-03-24 Thread Chris Fregly (JIRA)
Chris Fregly created SPARK-6514:
---

 Summary: For Kinesis Streaming, use the same region for DynamoDB 
(KCL checkpoints) as the Kinesis stream itself  
 Key: SPARK-6514
 URL: https://issues.apache.org/jira/browse/SPARK-6514
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.3.0
Reporter: Chris Fregly


this was not supported when i originally wrote this receiver.

this is now supported.  also, upgrade to the latest Kinesis Client Library 
(KCL) which is 1.2, i believe.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4184) Improve Spark Streaming documentation to address commonly-asked questions

2015-03-04 Thread Chris Fregly (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14348241#comment-14348241
 ] 

Chris Fregly commented on SPARK-4184:
-

add reference to Kinesis Docs re: the following Kinesis feature:

https://aws.amazon.com/blogs/aws/amazon-kinesis-update-reduced-prop-delay/

 Improve Spark Streaming documentation to address commonly-asked questions 
 --

 Key: SPARK-4184
 URL: https://issues.apache.org/jira/browse/SPARK-4184
 Project: Spark
  Issue Type: Documentation
  Components: Streaming
Reporter: Chris Fregly
  Labels: documentation, streaming

 Improve Streaming documentation including API descriptions, 
 concurrency/thread safety, fault tolerance, replication, checkpointing, 
 scalability, resource allocation and utilization, back pressure, and 
 monitoring.
 also, add a section to the kinesis streaming guide describing how to use IAM 
 roles with the Spark Kinesis Receiver.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4184) Improve Spark Streaming documentation

2015-03-03 Thread Chris Fregly (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Fregly updated SPARK-4184:

Target Version/s: 1.3.1  (was: 1.2.0)

 Improve Spark Streaming documentation
 -

 Key: SPARK-4184
 URL: https://issues.apache.org/jira/browse/SPARK-4184
 Project: Spark
  Issue Type: Documentation
  Components: Streaming
Reporter: Chris Fregly
  Labels: documentation, streaming

 Improve Streaming documentation including API descriptions, 
 concurrency/thread safety, fault tolerance, replication, checkpointing, 
 scalability, resource allocation and utilization, back pressure, and 
 monitoring.
 also, add a section to the kinesis streaming guide describing how to use IAM 
 roles with the Spark Kinesis Receiver.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4184) Improve Spark Streaming documentation to address commonly-asked questions

2015-03-03 Thread Chris Fregly (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Fregly updated SPARK-4184:

Summary: Improve Spark Streaming documentation to address commonly-asked 
questions   (was: Improve Spark Streaming documentation)

 Improve Spark Streaming documentation to address commonly-asked questions 
 --

 Key: SPARK-4184
 URL: https://issues.apache.org/jira/browse/SPARK-4184
 Project: Spark
  Issue Type: Documentation
  Components: Streaming
Reporter: Chris Fregly
  Labels: documentation, streaming

 Improve Streaming documentation including API descriptions, 
 concurrency/thread safety, fault tolerance, replication, checkpointing, 
 scalability, resource allocation and utilization, back pressure, and 
 monitoring.
 also, add a section to the kinesis streaming guide describing how to use IAM 
 roles with the Spark Kinesis Receiver.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4184) Improve Spark Streaming documentation to address commonly-asked questions

2015-03-03 Thread Chris Fregly (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14346371#comment-14346371
 ] 

Chris Fregly commented on SPARK-4184:
-

Hey [~sowen]!

I'm gonna move this to 1.3.1.  This will compliment my work on 
https://issues.apache.org/jira/browse/SPARK-5960, although the documentation 
improvements will extend beyond Kinesis.

Does that sound reasonable?  I'll close this out once and for all!  :)

Thanks!

-Chris

 Improve Spark Streaming documentation to address commonly-asked questions 
 --

 Key: SPARK-4184
 URL: https://issues.apache.org/jira/browse/SPARK-4184
 Project: Spark
  Issue Type: Documentation
  Components: Streaming
Reporter: Chris Fregly
  Labels: documentation, streaming

 Improve Streaming documentation including API descriptions, 
 concurrency/thread safety, fault tolerance, replication, checkpointing, 
 scalability, resource allocation and utilization, back pressure, and 
 monitoring.
 also, add a section to the kinesis streaming guide describing how to use IAM 
 roles with the Spark Kinesis Receiver.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5960) Allow AWS credentials to be passed to KinesisUtils.createStream()

2015-03-01 Thread Chris Fregly (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342359#comment-14342359
 ] 

Chris Fregly commented on SPARK-5960:
-

linking to an old jira where this was originally brought up.

decided to add support for AWS credentials for non-IAM environments - of which 
we still see a fair amount.  

this will open up kinesis to more environments. 



 Allow AWS credentials to be passed to KinesisUtils.createStream()
 -

 Key: SPARK-5960
 URL: https://issues.apache.org/jira/browse/SPARK-5960
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.1.0
Reporter: Chris Fregly
Assignee: Chris Fregly

 While IAM roles are preferable, we're seeing a lot of cases where we need to 
 pass AWS credentials when creating the KinesisReceiver.
 Notes:
 * Make sure we don't log the credentials anywhere
 * Maintain compatibility with existing KinesisReceiver-based code.
  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6101) Create a SparkSQL DataSource API implementation for DynamoDB

2015-03-01 Thread Chris Fregly (JIRA)
Chris Fregly created SPARK-6101:
---

 Summary: Create a SparkSQL DataSource API implementation for 
DynamoDB
 Key: SPARK-6101
 URL: https://issues.apache.org/jira/browse/SPARK-6101
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.2.0
Reporter: Chris Fregly
 Fix For: 1.4.0


similar to https://github.com/databricks/spark-avro



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6102) Create a SparkSQL DataSource API implementation for Redshift

2015-03-01 Thread Chris Fregly (JIRA)
Chris Fregly created SPARK-6102:
---

 Summary: Create a SparkSQL DataSource API implementation for 
Redshift
 Key: SPARK-6102
 URL: https://issues.apache.org/jira/browse/SPARK-6102
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.2.0
Reporter: Chris Fregly






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4144) Support incremental model training of Naive Bayes classifier

2015-02-28 Thread Chris Fregly (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14341901#comment-14341901
 ] 

Chris Fregly commented on SPARK-4144:
-

Hey [~freeman-lab]!

I was literally just talking to [~josephkb] in the office last week about 
picking this up.  Great timing!

Let's coordinate offline.  I'll shoot you an email.

-Chris



 Support incremental model training of Naive Bayes classifier
 

 Key: SPARK-4144
 URL: https://issues.apache.org/jira/browse/SPARK-4144
 Project: Spark
  Issue Type: Improvement
  Components: MLlib, Streaming
Reporter: Chris Fregly
Assignee: Jeremy Freeman

 Per Xiangrui Meng from the following user list discussion:  
 http://mail-archives.apache.org/mod_mbox/spark-user/201408.mbox/%3CCAJgQjQ_QjMGO=jmm8weq1v8yqfov8du03abzy7eeavgjrou...@mail.gmail.com%3E

 For Naive Bayes, we need to update the priors and conditional
 probabilities, which means we should also remember the number of
 observations for the updates.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5960) Allow AWS credentials to be passed to KinesisUtils.createStream()

2015-02-24 Thread Chris Fregly (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14335846#comment-14335846
 ] 

Chris Fregly commented on SPARK-5960:
-

pushing this up to 1.3.1

 Allow AWS credentials to be passed to KinesisUtils.createStream()
 -

 Key: SPARK-5960
 URL: https://issues.apache.org/jira/browse/SPARK-5960
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.1.0
Reporter: Chris Fregly
Assignee: Chris Fregly

 While IAM roles are preferable, we're seeing a lot of cases where we need to 
 pass AWS credentials when creating the KinesisReceiver.
 Notes:
 * Make sure we don't log the credentials anywhere
 * Maintain compatibility with existing KinesisReceiver-based code.
  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5960) Allow AWS credentials to be passed to KinesisUtils.createStream()

2015-02-24 Thread Chris Fregly (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Fregly updated SPARK-5960:

Target Version/s: 1.3.1  (was: 1.4.0)

 Allow AWS credentials to be passed to KinesisUtils.createStream()
 -

 Key: SPARK-5960
 URL: https://issues.apache.org/jira/browse/SPARK-5960
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.1.0
Reporter: Chris Fregly
Assignee: Chris Fregly

 While IAM roles are preferable, we're seeing a lot of cases where we need to 
 pass AWS credentials when creating the KinesisReceiver.
 Notes:
 * Make sure we don't log the credentials anywhere
 * Maintain compatibility with existing KinesisReceiver-based code.
  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5959) Create a ReliableKinesisReceiver similar to the ReliableKafkaReceiver

2015-02-23 Thread Chris Fregly (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14334256#comment-14334256
 ] 

Chris Fregly commented on SPARK-5959:
-

Checkpointing at a specific sequence number is supported by the 
IRecordProcessorCheckpointer interface.

 Create a ReliableKinesisReceiver similar to the ReliableKafkaReceiver
 -

 Key: SPARK-5959
 URL: https://issues.apache.org/jira/browse/SPARK-5959
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.1.0
Reporter: Chris Fregly

 After each block is stored reliably in the WAL (after the store() call 
 returns),  ACK back to Kinesis.
 There is still the issue of the ReliableKinesisReceiver dying before the ACK 
 back to Kinesis, however no data will be lost.  Duplicate data is still 
 possible.
 Notes:
 * Make sure we're not overloading the checkpoint control plane which uses 
 DynamoDB.
 * May need to disable auto-checkpointing and remove the checkpoint interval.
 * Maintain compatibility with existing KinesisReceiver-based code.
  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5959) Create a ReliableKinesisReceiver similar to the ReliableKafkaReceiver

2015-02-23 Thread Chris Fregly (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Fregly updated SPARK-5959:

  Component/s: Streaming
  Description: 
After each block is stored reliably in the WAL (after the store() call 
returns),  ACK back to Kinesis.

There is still the issue of the ReliableKinesisReceiver dying before the ACK 
back to Kinesis, however no data will be lost.  Duplicate data is still 
possible.

Notes:
* Make sure we're not overloading the checkpoint control plane which uses 
DynamoDB.
* May need to disable auto-checkpointing and remove the checkpoint interval.
* Maintain compatibility with existing KinesisReceiver-based code.
 


  was:
After each block is stored reliably in the WAL (after the store() call 
returns),  ACK back to Kinesis.

There is still the issue of the ReliableKinesisReceiver dying before the ACK 
back to Kinesis, however no data will be lost.  Duplicate data is still 
possible.

Notes
* Make sure we're not overloading the checkpoint control plane which uses 
DynamoDB.
* May need to disable auto-checkpointing and remove the checkpoint interval.

 


 Target Version/s: 1.4.0
Affects Version/s: 1.1.0
Fix Version/s: (was: 1.4.0)

 Create a ReliableKinesisReceiver similar to the ReliableKafkaReceiver
 -

 Key: SPARK-5959
 URL: https://issues.apache.org/jira/browse/SPARK-5959
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.1.0
Reporter: Chris Fregly

 After each block is stored reliably in the WAL (after the store() call 
 returns),  ACK back to Kinesis.
 There is still the issue of the ReliableKinesisReceiver dying before the ACK 
 back to Kinesis, however no data will be lost.  Duplicate data is still 
 possible.
 Notes:
 * Make sure we're not overloading the checkpoint control plane which uses 
 DynamoDB.
 * May need to disable auto-checkpointing and remove the checkpoint interval.
 * Maintain compatibility with existing KinesisReceiver-based code.
  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5959) Create a ReliableKinesisReceiver similar to the ReliableKafkaReceiver

2015-02-23 Thread Chris Fregly (JIRA)
Chris Fregly created SPARK-5959:
---

 Summary: Create a ReliableKinesisReceiver similar to the 
ReliableKafkaReceiver
 Key: SPARK-5959
 URL: https://issues.apache.org/jira/browse/SPARK-5959
 Project: Spark
  Issue Type: Improvement
Reporter: Chris Fregly
 Fix For: 1.4.0


After each block is stored reliably in the WAL (after the store() call 
returns),  ACK back to Kinesis.

There is still the issue of the ReliableKinesisReceiver dying before the ACK 
back to Kinesis, however no data will be lost.  Duplicate data is still 
possible.

Notes
* Make sure we're not overloading the checkpoint control plane which uses 
DynamoDB.
* May need to disable auto-checkpointing and remove the checkpoint interval.

 




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5961) Allow specific nodes in a Spark Streaming cluster to be dedicated/preferred as Receiver Worker Nodes versus regular Spark Worker Nodes

2015-02-23 Thread Chris Fregly (JIRA)
Chris Fregly created SPARK-5961:
---

 Summary: Allow specific nodes in a Spark Streaming cluster to be 
dedicated/preferred as Receiver Worker Nodes versus regular Spark Worker Nodes
 Key: SPARK-5961
 URL: https://issues.apache.org/jira/browse/SPARK-5961
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.1.0
Reporter: Chris Fregly


This type of configuration has come up a lot where certain nodes should be 
dedicated as Spark Streaming Receivers and others as regular Spark Workers. 

The reasons include the following:
1) Different instance types/sizes for Receivers vs regular Workers
2) Different OS tuning params for Receivers vs regular Workers
...

  
 




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4144) Support incremental model training of Naive Bayes classifier

2015-02-04 Thread Chris Fregly (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14305351#comment-14305351
 ] 

Chris Fregly commented on SPARK-4144:
-

Hi there!  Any update on this?  I was thinking of working on this as it's been 
idle for the last few months.

Lemme know.

Thanks!

-Chris

 Support incremental model training of Naive Bayes classifier
 

 Key: SPARK-4144
 URL: https://issues.apache.org/jira/browse/SPARK-4144
 Project: Spark
  Issue Type: Improvement
  Components: MLlib, Streaming
Reporter: Chris Fregly
Assignee: Liquan Pei

 Per Xiangrui Meng from the following user list discussion:  
 http://mail-archives.apache.org/mod_mbox/spark-user/201408.mbox/%3CCAJgQjQ_QjMGO=jmm8weq1v8yqfov8du03abzy7eeavgjrou...@mail.gmail.com%3E

 For Naive Bayes, we need to update the priors and conditional
 probabilities, which means we should also remember the number of
 observations for the updates.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4184) Improve Spark Streaming documentation

2014-12-31 Thread Chris Fregly (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14262377#comment-14262377
 ] 

Chris Fregly commented on SPARK-4184:
-

hey josh-

lemme go through my notes and figure out if everything got into TD's latest 
iteration of the docs.  i'll get back to you in the next few days.  good catch.

 Improve Spark Streaming documentation
 -

 Key: SPARK-4184
 URL: https://issues.apache.org/jira/browse/SPARK-4184
 Project: Spark
  Issue Type: Documentation
  Components: Streaming
Reporter: Chris Fregly
  Labels: documentation, streaming

 Improve Streaming documentation including API descriptions, 
 concurrency/thread safety, fault tolerance, replication, checkpointing, 
 scalability, resource allocation and utilization, back pressure, and 
 monitoring.
 also, add a section to the kinesis streaming guide describing how to use IAM 
 roles with the Spark Kinesis Receiver.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4689) Unioning 2 SchemaRDDs should return a SchemaRDD in Python, Scala, and Java

2014-12-01 Thread Chris Fregly (JIRA)
Chris Fregly created SPARK-4689:
---

 Summary: Unioning 2 SchemaRDDs should return a SchemaRDD in 
Python, Scala, and Java
 Key: SPARK-4689
 URL: https://issues.apache.org/jira/browse/SPARK-4689
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.1.0
Reporter: Chris Fregly
Priority: Minor


Currently, you need to use unionAll() in Scala.  

Python does not expose this functionality at the moment.

The current work around is to use the UNION ALL HiveQL functionality detailed 
here:  https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Union





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3640) KinesisUtils should accept a credentials object instead of forcing DefaultCredentialsProvider

2014-11-09 Thread Chris Fregly (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14204334#comment-14204334
 ] 

Chris Fregly commented on SPARK-3640:
-

quick update:

Aniket and I spoke off-line about using AWS IAM Instance Profiles for EC2 
instances.  These work similar to IAM User Profiles - you can apply 
fine-grained IAM Policies to EC2 instances.

The DefaultCredentialsProvider handles all of these sources of AWS credentials.

I am adding all of this to the Kinesis Spark Streaming Guide, btw.

Summary:  we may be able to close this jira without a change.  just waiting for 
Aniket to confirm that this AWS Instance Profile approach satisfies his need.  
it seems to be a safer approach than passing credentials between Spark Driver 
and Worker nodes.


 KinesisUtils should accept a credentials object instead of forcing 
 DefaultCredentialsProvider
 -

 Key: SPARK-3640
 URL: https://issues.apache.org/jira/browse/SPARK-3640
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.1.0
Reporter: Aniket Bhatnagar
  Labels: kinesis

 KinesisUtils should accept AWS Credentials as a parameter and should default 
 to DefaultCredentialsProvider if no credentials are provided. Currently, the 
 implementation forces usage of DefaultCredentialsProvider which can be a pain 
 especially when jobs are run by multiple  unix users.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3639) Kinesis examples set master as local

2014-11-01 Thread Chris Fregly (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14193285#comment-14193285
 ] 

Chris Fregly commented on SPARK-3639:
-

great catch, aniket!

for some background context, i was trying to make the sample easier to run out 
of the box.  i overlooked the spark-submit scenario, unfortunately.  thanks for 
fixing this.

few things:
1) does the Streaming Kinesis Guide (docs/streaming-kinesis-integration.md) 
need updating with your change?  specifically, the Running the Example section? 
 i don't think so, but something to double-check.

2) i noticed you put a comment in the scaladoc about needing +1 workers/threads 
than receivers, perhaps we should reword this to say 

  (number of kinesis shards+1) workers/threads are needed

because the number of receivers is determined by the number of shards in the 
kinesis stream.  might tighten up the message a bit.  

3)  should we throw an error if the number of workers/threads is not 
sufficient?  nobody likes an error message, but might be helpful here.  this is 
the basis of https://issues.apache.org/jira/browse/SPARK-2475, btw.  might want 
to keep an eye on that jira.

thanks again, man.  great catch.

-chris

 Kinesis examples set master as local
 

 Key: SPARK-3639
 URL: https://issues.apache.org/jira/browse/SPARK-3639
 Project: Spark
  Issue Type: Bug
  Components: Examples, Streaming
Affects Versions: 1.0.2, 1.1.0
Reporter: Aniket Bhatnagar
Assignee: Aniket Bhatnagar
Priority: Minor
  Labels: examples
 Fix For: 1.1.1, 1.2.0


 Kinesis examples set master as local thus not allowing the example to be 
 tested on a cluster



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4184) Improve Spark Streaming documentation

2014-11-01 Thread Chris Fregly (JIRA)
Chris Fregly created SPARK-4184:
---

 Summary: Improve Spark Streaming documentation
 Key: SPARK-4184
 URL: https://issues.apache.org/jira/browse/SPARK-4184
 Project: Spark
  Issue Type: Documentation
  Components: Streaming
Reporter: Chris Fregly
 Fix For: 1.2.0


Improve Streaming documentation including API descriptions, concurrency/thread 
safety, fault tolerance, replication, checkpointing, scalability, resource 
allocation and utilization, back pressure, and monitoring.

also, add a section to the kinesis streaming guide describing how to use IAM 
roles with the Spark Kinesis Receiver.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3640) KinesisUtils should accept a credentials object instead of forcing DefaultCredentialsProvider

2014-10-31 Thread Chris Fregly (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14192411#comment-14192411
 ] 

Chris Fregly commented on SPARK-3640:
-

Agreed that this was no ideal when i first chose this implementation.  And as 
you mentioned, the NotSerializableException is exactly why I went with the 
DefaultCredentialsProvider.

So I spent some time trying to solve this using AWS IAM Roles on separate users 
under your root AWS account.  This appears to work well with the existing 
DefaultCredentialsProvider.

Is this a viable option for you?  

Basically, every user would get their own ACCESS_KEY_ID and SECRET_KEY.  This 
would be used in place of the root credentials.

For thoroughness, I've included links to the instructions as well as an example 
IAM Policy JSON (I'll also add this to the Spark Kinesis Developer Guide 
(http://spark.apache.org/docs/latest/streaming-kinesis-integration.html):

Creating IAM users
http://docs.aws.amazon.com/IAM/latest/UserGuide/Using_SettingUpUser.html
https://console.aws.amazon.com/iam/home?#security_credential 

Setting up Kinesis, DynamoDB, and CloudWatch IAM Policy for the new users
http://docs.aws.amazon.com/kinesis/latest/dev/kinesis-using-iam.html

IAM Policy Generator
http://awspolicygen.s3.amazonaws.com/policygen.html

Attaching the Custom Policy 
https://console.aws.amazon.com/iam/home?#users
Select the user
Select Attach Policy
Select Custom Policy

IAM Policy JSON 
This is already generated using the Policy Generator above... just fill 
in the missing pieces specific to your environment.
{
  Statement: [
{
  Sid: Stmt1414784467497,
  Action: kinesis:*,
  Effect: Allow,
  Resource: 
arn:aws:kinesis:region-of-stream:aws-account-id:stream/stream-name
},
{
  Sid: Stmt1414784693732,
  Action: dynamodb:*,
  Effect: Allow,
  Resource: 
arn:aws:dynamodb:us-east-1:aws-account-id:table/dynamodb-tablename
},
{
  Sid: Stmt1414785131046,
  Action: cloudwatch:*,
  Effect: Allow,
  Resource: *
}
  ]
}

Notes:
* The region of the DynamoDB table is intentionally hard-coded to us-east-1 as 
this is how Kinesis currently works
* The DynamoDB table is the same as the application name of the Kinesis 
Streaming Application.  The sample included with the Spark distribution uses 
KinesisWordCount for the application/table name.


Is this a sufficient workaround.  Using IAM Policies is an AWS best practice, 
but not sure if this aligns with your existing environment.  If not, I can 
continue to investigate exposing that CredentialsProvider

Lemme know, Aniket!


 KinesisUtils should accept a credentials object instead of forcing 
 DefaultCredentialsProvider
 -

 Key: SPARK-3640
 URL: https://issues.apache.org/jira/browse/SPARK-3640
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.1.0
Reporter: Aniket Bhatnagar
  Labels: kinesis

 KinesisUtils should accept AWS Credentials as a parameter and should default 
 to DefaultCredentialsProvider if no credentials are provided. Currently, the 
 implementation forces usage of DefaultCredentialsProvider which can be a pain 
 especially when jobs are run by multiple  unix users.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4144) Support incremental model training of Naive Bayes classifier

2014-10-29 Thread Chris Fregly (JIRA)
Chris Fregly created SPARK-4144:
---

 Summary: Support incremental model training of Naive Bayes 
classifier
 Key: SPARK-4144
 URL: https://issues.apache.org/jira/browse/SPARK-4144
 Project: Spark
  Issue Type: Improvement
  Components: MLlib, Streaming
Reporter: Chris Fregly


Per Xiangrui Meng from the following user list discussion:  
http://mail-archives.apache.org/mod_mbox/spark-user/201408.mbox/%3CCAJgQjQ_QjMGO=jmm8weq1v8yqfov8du03abzy7eeavgjrou...@mail.gmail.com%3E
   

For Naive Bayes, we need to update the priors and conditional
probabilities, which means we should also remember the number of
observations for the updates.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2579) Reading from S3 returns an inconsistent number of items with Spark 0.9.1

2014-08-30 Thread Chris Fregly (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14116643#comment-14116643
 ] 

Chris Fregly commented on SPARK-2579:
-

interesting and possibly-related blog post from netflix earlier this year:  
http://techblog.netflix.com/2014/01/s3mper-consistency-in-cloud.html

 Reading from S3 returns an inconsistent number of items with Spark 0.9.1
 

 Key: SPARK-2579
 URL: https://issues.apache.org/jira/browse/SPARK-2579
 Project: Spark
  Issue Type: Bug
  Components: Input/Output
Affects Versions: 0.9.1
Reporter: Eemil Lagerspetz
Priority: Critical
  Labels: hdfs, read, s3, skipping

 I have created a random matrix of 1M rows with 10K items on each row, 
 semicolon-separated. While reading it with Spark 0.9.1 and doing a count, I 
 consistently get less than 1M rows, and a different number every time at that 
 ( !! ). Example below:
 head -n 1 tool-generate-random-matrix*log
 == tool-generate-random-matrix-999158.log ==
 Row item counts: 999158
 == tool-generate-random-matrix.log ==
 Row item counts: 997163
 The data is split into 1000 partitions. When I download it using s3cmd sync, 
 and run the following AWK on it, I get the correct number of rows in each 
 partition (1000x1000 = 1M). What is up?
 {code:title=checkrows.sh|borderStyle=solid}
 for k in part-0*
 do
   echo $k
   awk -F ; '
 NF != 1 {
   print Wrong number of items:,NF
 }
 END {
   if (NR != 1000) {
 print Wrong number of rows:,NR
   }
 }' $k
 done
 {code}
 The matrix generation and counting code is below:
 {code:title=Matrix.scala|borderStyle=solid}
 package fi.helsinki.cs.nodes.matrix
 import java.util.Random
 import org.apache.spark._
 import org.apache.spark.SparkContext._
 import scala.collection.mutable.ListBuffer
 import org.apache.spark.rdd.RDD
 import org.apache.spark.storage.StorageLevel._
 object GenerateRandomMatrix {
   def NewGeMatrix(rSeed: Int, rdd: RDD[Int], features: Int) = {
 rdd.mapPartitions(part = part.map(xarr = {
 val rdm = new Random(rSeed + xarr)
 val arr = new Array[Double](features)
 for (i - 0 until features)
   arr(i) = rdm.nextDouble()
 new Row(xarr, arr)
   }))
   }
   case class Row(id: Int, elements: Array[Double]) {}
   def rowFromText(line: String) = {
 val idarr = line.split( )
 val arr = idarr(1).split(;)
 // -1 to fix saved matrix indexing error
 new Row(idarr(0).toInt-1, arr.map(_.toDouble))
   }
   def main(args: Array[String]) {
 val master = args(0)
 val tasks = args(1).toInt
 val savePath = args(2)
 val read = args.contains(read)
 
 val datapoints = 100
 val features = 1
 val sc = new SparkContext(master, RandomMatrix)
 if (read) {
   val randomMatrix: RDD[Row] = sc.textFile(savePath, 
 tasks).map(rowFromText).persist(MEMORY_AND_DISK)
   println(Row item counts: + randomMatrix.count)
 } else {
   val rdd = sc.parallelize(0 until datapoints, tasks)
   val bcSeed = sc.broadcast(128)
   /* Generating a matrix of random Doubles */
   val randomMatrix = NewGeMatrix(bcSeed.value, rdd, 
 features).persist(MEMORY_AND_DISK)
   randomMatrix.map(row = row.id +   + 
 row.elements.mkString(;)).saveAsTextFile(savePath)
 }
 
 sc.stop
   }
 }
 {code}
 I run this with:
 appassembler/bin/tool-generate-random-matrix master 1000 
 s3n://keys@path/to/data 1matrix.log 2matrix.err
 Reading from HDFS gives the right count and right number of items on each 
 row. However, I had to run with the full path with the server name, just 
 /matrix does not work (it thinks I want file://):
 p=hdfs://ec2-54-188-6-77.us-west-2.compute.amazonaws.com:9000/matrix
 appassembler/bin/tool-generate-random-matrix $( cat 
 /root/spark-ec2/cluster-url ) 1000 $p read 1readmatrix.log 2readmatrix.err



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2475) Check whether #cores #receivers in local mode

2014-08-28 Thread Chris Fregly (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14114204#comment-14114204
 ] 

Chris Fregly commented on SPARK-2475:
-

another option for the examples, specifically, is to default the number of 
local threads similar to to how the Kinesis example does it:  

https://github.com/apache/spark/blob/ae58aea2d1435b5bb011e68127e1bcddc2edf5b2/extras/kinesis-asl/src/main/scala/org/apache/spark/examples/streaming/KinesisWordCountASL.scala#L104

i get the number of shards in the given Kinesis stream and add 1.  the goal was 
to make this example work out of the box with little friction - even an error 
message can be discouraging.

for the other examples, we could just default to 2.  the advanced user can 
override if they want.  though i don't think i support an override in my 
kinesis example.  whoops!  :)

 Check whether #cores  #receivers in local mode
 ---

 Key: SPARK-2475
 URL: https://issues.apache.org/jira/browse/SPARK-2475
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Reporter: Tathagata Das

 When the number of slots in local mode is not more than the number of 
 receivers, then the system should throw an error. Otherwise the system just 
 keeps waiting for resources to process the received data.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1981) Add AWS Kinesis streaming support

2014-08-02 Thread Chris Fregly (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14083836#comment-14083836
 ] 

Chris Fregly commented on SPARK-1981:
-

hey nick-

due to the Kinesis Client Library's ASL license restriction, we ended up 
isolating all kinesis-related code to the extras/kinesis-asl module.

this module can be activated at build time by including -Pkinesis-asl in either 
sbt or maven.

this is all documented here, btw:  
https://github.com/apache/spark/blob/master/docs/streaming-kinesis.md

looks like i messed up the markdown a bit.  whoops!  but the details are all 
there.  i'll try to clean that up.

 Add AWS Kinesis streaming support
 -

 Key: SPARK-1981
 URL: https://issues.apache.org/jira/browse/SPARK-1981
 Project: Spark
  Issue Type: New Feature
  Components: Streaming
Reporter: Chris Fregly
Assignee: Chris Fregly
 Fix For: 1.1.0


 Add AWS Kinesis support to Spark Streaming.
 Initial discussion occured here:  https://github.com/apache/spark/pull/223
 I discussed this with Parviz from AWS recently and we agreed that I would 
 take this over.
 Look for a new PR that takes into account all the feedback from the earlier 
 PR including spark-1.0-compliant implementation, AWS-license-aware build 
 support, tests, comments, and style guide compliance.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-2770) Rename spark-ganglia-lgpl to ganglia-lgpl

2014-07-31 Thread Chris Fregly (JIRA)
Chris Fregly created SPARK-2770:
---

 Summary: Rename spark-ganglia-lgpl to ganglia-lgpl
 Key: SPARK-2770
 URL: https://issues.apache.org/jira/browse/SPARK-2770
 Project: Spark
  Issue Type: Improvement
  Components: Build
Reporter: Chris Fregly
Priority: Minor
 Fix For: 1.1.0






--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1981) Add AWS Kinesis streaming support

2014-07-29 Thread Chris Fregly (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14078422#comment-14078422
 ] 

Chris Fregly commented on SPARK-1981:
-

[~matei] the ec2 scripts allow you to specify a github repo and commit hash, so 
i assumed they can build from source.  if this is the case, i need the ability 
to pass the list of -P build profiles such as -Pspark-kinesis-asl which i don't 
think exists currently.

how about the audit and release process?  have i covered everything there?

thanks!

-chris





 Add AWS Kinesis streaming support
 -

 Key: SPARK-1981
 URL: https://issues.apache.org/jira/browse/SPARK-1981
 Project: Spark
  Issue Type: New Feature
  Components: Streaming
Reporter: Chris Fregly
Assignee: Chris Fregly

 Add AWS Kinesis support to Spark Streaming.
 Initial discussion occured here:  https://github.com/apache/spark/pull/223
 I discussed this with Parviz from AWS recently and we agreed that I would 
 take this over.
 Look for a new PR that takes into account all the feedback from the earlier 
 PR including spark-1.0-compliant implementation, AWS-license-aware build 
 support, tests, comments, and style guide compliance.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1981) Add AWS Kinesis streaming support

2014-07-23 Thread Chris Fregly (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14072761#comment-14072761
 ] 

Chris Fregly commented on SPARK-1981:
-

in addition to the ec2 scripts, can someone verify that all other 
build-and-release-related use cases have been covered?  i mimic'd the ganglia 
extras project, but this project doesn't seem to be covered by either the ec2 
scripts or the audit-release process.  perhaps we should add it, as well?

any advice from someone closer to the build and release process would be 
appreciated.

thanks!

-chris

 Add AWS Kinesis streaming support
 -

 Key: SPARK-1981
 URL: https://issues.apache.org/jira/browse/SPARK-1981
 Project: Spark
  Issue Type: New Feature
  Components: Streaming
Reporter: Chris Fregly
Assignee: Chris Fregly

 Add AWS Kinesis support to Spark Streaming.
 Initial discussion occured here:  https://github.com/apache/spark/pull/223
 I discussed this with Parviz from AWS recently and we agreed that I would 
 take this over.
 Look for a new PR that takes into account all the feedback from the earlier 
 PR including spark-1.0-compliant implementation, AWS-license-aware build 
 support, tests, comments, and style guide compliance.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1981) Add AWS Kinesis streaming support

2014-07-19 Thread Chris Fregly (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14067696#comment-14067696
 ] 

Chris Fregly commented on SPARK-1981:
-

[~pwendell]  is there anything i need to do within the spark_ec2 scripts to 
makes sure kinesis is built and/or enabled when EC2 instances are created?  i 
want to make sure i'm covering all the bases.

 Add AWS Kinesis streaming support
 -

 Key: SPARK-1981
 URL: https://issues.apache.org/jira/browse/SPARK-1981
 Project: Spark
  Issue Type: New Feature
  Components: Streaming
Reporter: Chris Fregly
Assignee: Chris Fregly

 Add AWS Kinesis support to Spark Streaming.
 Initial discussion occured here:  https://github.com/apache/spark/pull/223
 I discussed this with Parviz from AWS recently and we agreed that I would 
 take this over.
 Look for a new PR that takes into account all the feedback from the earlier 
 PR including spark-1.0-compliant implementation, AWS-license-aware build 
 support, tests, comments, and style guide compliance.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1981) Add AWS Kinesis streaming support

2014-07-16 Thread Chris Fregly (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14063247#comment-14063247
 ] 

Chris Fregly commented on SPARK-1981:
-

PR:  https://github.com/apache/spark/pull/1434

 Add AWS Kinesis streaming support
 -

 Key: SPARK-1981
 URL: https://issues.apache.org/jira/browse/SPARK-1981
 Project: Spark
  Issue Type: New Feature
  Components: Streaming
Reporter: Chris Fregly
Assignee: Chris Fregly

 Add AWS Kinesis support to Spark Streaming.
 Initial discussion occured here:  https://github.com/apache/spark/pull/223
 I discussed this with Parviz from AWS recently and we agreed that I would 
 take this over.
 Look for a new PR that takes into account all the feedback from the earlier 
 PR including spark-1.0-compliant implementation, AWS-license-aware build 
 support, tests, comments, and style guide compliance.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1981) Add AWS Kinesis streaming support

2014-07-14 Thread Chris Fregly (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14061007#comment-14061007
 ] 

Chris Fregly commented on SPARK-1981:
-

quick update:

i completed all code, examples, tests, build, and documentation changes this 
weekend.  everything looks good.

however, when i went to merge last night, i noticed this PR:  
https://github.com/apache/spark/pull/772 

this changes the underlying maven and sbt builds a bit - for the better, of 
course!

reverting my build changes and adapting to the new build structure are the last 
step which i plan to tackle today.

almost there!


 Add AWS Kinesis streaming support
 -

 Key: SPARK-1981
 URL: https://issues.apache.org/jira/browse/SPARK-1981
 Project: Spark
  Issue Type: New Feature
  Components: Streaming
Reporter: Chris Fregly
Assignee: Chris Fregly

 Add AWS Kinesis support to Spark Streaming.
 Initial discussion occured here:  https://github.com/apache/spark/pull/223
 I discussed this with Parviz from AWS recently and we agreed that I would 
 take this over.
 Look for a new PR that takes into account all the feedback from the earlier 
 PR including spark-1.0-compliant implementation, AWS-license-aware build 
 support, tests, comments, and style guide compliance.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1981) Add AWS Kinesis streaming support

2014-07-10 Thread Chris Fregly (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14057847#comment-14057847
 ] 

Chris Fregly commented on SPARK-1981:
-

hey guys-

i'm in the final phases of cleanup.  i refactored quite a bit of the original 
code to make things more testable - and easier to understand.  

oh, and i did, indeed, choose the optional-module route.  we'll address the 
additional complexity through documentation.  that's what i'm working on right 
now, actually.

hoping to submit the PR by tomorrow or this weekend at the very latest.  

the goal is to get this in to the 1.1 release which has a timeline outlined 
here:  https://cwiki.apache.org/confluence/display/SPARK/Wiki+Homepage 

thanks!

-chris

 Add AWS Kinesis streaming support
 -

 Key: SPARK-1981
 URL: https://issues.apache.org/jira/browse/SPARK-1981
 Project: Spark
  Issue Type: New Feature
  Components: Streaming
Reporter: Chris Fregly
Assignee: Chris Fregly

 Add AWS Kinesis support to Spark Streaming.
 Initial discussion occured here:  https://github.com/apache/spark/pull/223
 I discussed this with Parviz from AWS recently and we agreed that I would 
 take this over.
 Look for a new PR that takes into account all the feedback from the earlier 
 PR including spark-1.0-compliant implementation, AWS-license-aware build 
 support, tests, comments, and style guide compliance.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1981) Add AWS Kinesis streaming support

2014-07-05 Thread Chris Fregly (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14053037#comment-14053037
 ] 

Chris Fregly commented on SPARK-1981:
-

[~matei] [~pwendell] i'm in the process of making the Kinesis Streaming 
component an optional module similar to ganglia per 
https://issues.apache.org/jira/browse/LEGAL-198

unlike the ganglia component, however, this component has tests and examples 
similar to the other streaming implementations such as Kafka and Flume.  these 
other implementations have their tests and examples in external/ and examples/, 
respectively.
 
if i understand correctly, i need to put the kinesis streaming code, test, and 
examples all within extras/, correct?

this will cause a bit of confusion for people searching the examples/ source, 
but - unless i'm missing something -  this is the best we can do given the 
current build scripts and directory structure.

is this the correct approach?

the other option is to stick with the base AWS Java SDK which is under Apache 
2.0 license (https://github.com/aws/aws-sdk-java/blob/master/LICENSE.txt)

we'd lose some of the convenience goodies that the Kinesis Client Library gives 
us like worker load balancing, shard autoscaling, checkpointing, etc but would 
simplify the build.

definitely not optimal, but throwing it out as an option.

thoughts?

 Add AWS Kinesis streaming support
 -

 Key: SPARK-1981
 URL: https://issues.apache.org/jira/browse/SPARK-1981
 Project: Spark
  Issue Type: New Feature
  Components: Streaming
Reporter: Chris Fregly
Assignee: Chris Fregly

 Add AWS Kinesis support to Spark Streaming.
 Initial discussion occured here:  https://github.com/apache/spark/pull/223
 I discussed this with Parviz from AWS recently and we agreed that I would 
 take this over.
 Look for a new PR that takes into account all the feedback from the earlier 
 PR including spark-1.0-compliant implementation, AWS-license-aware build 
 support, tests, comments, and style guide compliance.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1981) Add AWS Kinesis streaming support

2014-07-01 Thread Chris Fregly (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14049484#comment-14049484
 ] 

Chris Fregly commented on SPARK-1981:
-

hey jonathan!

i was just talking to the databricks guys about this at the spark summit 
yesterday.  it's my top priority after the summit ends (tomorrow).   my goal is 
to get a PR submitted by this weekend. 

the code is written.  i just need to do some cleanup.

[~pwendell] can you assign this jira to me?  i don't have permission, it 
appears.

thanks!

-chris

 Add AWS Kinesis streaming support
 -

 Key: SPARK-1981
 URL: https://issues.apache.org/jira/browse/SPARK-1981
 Project: Spark
  Issue Type: New Feature
  Components: Streaming
Reporter: Chris Fregly

 Add AWS Kinesis support to Spark Streaming.
 Initial discussion occured here:  https://github.com/apache/spark/pull/223
 I discussed this with Parviz from AWS recently and we agreed that I would 
 take this over.
 Look for a new PR that takes into account all the feedback from the earlier 
 PR including spark-1.0-compliant implementation, AWS-license-aware build 
 support, tests, comments, and style guide compliance.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1981) Add AWS Kinesis streaming support

2014-05-31 Thread Chris Fregly (JIRA)
Chris Fregly created SPARK-1981:
---

 Summary: Add AWS Kinesis streaming support
 Key: SPARK-1981
 URL: https://issues.apache.org/jira/browse/SPARK-1981
 Project: Spark
  Issue Type: New Feature
  Components: Streaming
Reporter: Chris Fregly


Add AWS Kinesis support to Spark Streaming.

Initial discussion occured here:  https://github.com/apache/spark/pull/223

I discussed this with Parviz from AWS recently and we agreed that I would take 
this over.

Look for a new PR that takes into account all the feedback from the earlier PR 
including spark-1.0-compliant implementation, AWS-license-aware build support, 
tests, comments, and style guide compliance.



--
This message was sent by Atlassian JIRA
(v6.2#6252)