[jira] [Commented] (SPARK-9492) LogisticRegression in R should provide model statistics

2015-10-27 Thread Bilind Hajer (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14977134#comment-14977134
 ] 

Bilind Hajer commented on SPARK-9492:
-

So is there no other way to get coefficients from a PipeLineModel of family 
binomial? 

> LogisticRegression in R should provide model statistics
> ---
>
> Key: SPARK-9492
> URL: https://issues.apache.org/jira/browse/SPARK-9492
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, R
>Reporter: Eric Liang
>
> Like ml LinearRegression, LogisticRegression should provide a training 
> summary including feature names and their coefficients.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11190) SparkR support for cassandra collection types.

2015-10-21 Thread Bilind Hajer (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14968596#comment-14968596
 ] 

Bilind Hajer commented on SPARK-11190:
--

> df.recipes <- read.df(sqlContext, source = 
> "org.apache.spark.sql.cassandra",keyspace = "datahub", table = 
> "person_recipes")
> someRecords <- head(df.recipes)
> mapField <- someRecords$recipes[[1]]

> ls( mapField ) 
 [1] "1000"   "1100"   "12000"  "18000"  "2000"   "22000"  "22074"  "24000" 
 [9] "28000"  "3000"   "33000"  "44000"  "45000"  "47000"  "48000"  "49000" 
[17] "5000"   "51000"  "53000"  "55000"  "56000"  "57000"  "57076"  "6" 
[25] "63000"  "64000"  "65000"  "66000"  "67000"  "73000"  "75000"  "79000" 
[33] "8"  "82000"  "83000"  "84000"  "87000"  "89000"  "9"  "999000"

> mapField[["1000"]]
[1] 0

success! Thanks guys, R environment data types are pretty cool, didn't know 
they existed. 

> SparkR support for cassandra collection types. 
> ---
>
> Key: SPARK-11190
> URL: https://issues.apache.org/jira/browse/SPARK-11190
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.5.1
> Environment: SparkR Version: 1.5.1
> Cassandra Version: 2.1.6
> R Version: 3.2.2 
> Cassandra Connector version: 1.5.0-M2
>Reporter: Bilind Hajer
>  Labels: cassandra, dataframe, sparkR
>
> I want to create a data frame from a Cassandra keyspace and column family in 
> sparkR. 
> I am able to create data frames from tables which do not include any 
> Cassandra collection datatypes, 
> such as Map, Set and List.  But, many of the schemas that I need data from, 
> do include these collection data types. 
> Here is my local environment. 
> SparkR Version: 1.5.1
> Cassandra Version: 2.1.6
> R Version: 3.2.2 
> Cassandra Connector version: 1.5.0-M2
> To test this issue, I did the following iterative process. 
> sudo ./sparkR --packages 
> com.datastax.spark:spark-cassandra-connector_2.10:1.5.0-M2 --conf 
> spark.cassandra.connection.host=127.0.0.1
> Running this command, with sparkR gives me access to the spark cassandra 
> connector package I need, 
> and connects me to my local cqlsh server ( which is up and running while 
> running this code in sparkR shell ). 
> CREATE TABLE test_table (
>   column_1 int,
>   column_2 text,
>   column_3 float,
>   column_4 uuid,
>   column_5 timestamp,
>   column_6 boolean,
>   column_7 timeuuid,
>   column_8 bigint,
>   column_9 blob,
>   column_10   ascii,
>   column_11   decimal,
>   column_12   double,
>   column_13   inet,
>   column_14   varchar,
>   column_15   varint,
>   PRIMARY KEY( ( column_1, column_2 ) )
> ); 
> All of the above data types are supported. I insert dummy data after creating 
> this test schema. 
> For example, now in my sparkR shell, I run the following code. 
> df.test  <- read.df(sqlContext,  source = "org.apache.spark.sql.cassandra", 
> keyspace = "datahub", table = "test_table")
> assigns with no errors, then, 
> > schema(df.test)
> StructType
> |-name = "column_1", type = "IntegerType", nullable = TRUE
> |-name = "column_2", type = "StringType", nullable = TRUE
> |-name = "column_10", type = "StringType", nullable = TRUE
> |-name = "column_11", type = "DecimalType(38,18)", nullable = TRUE
> |-name = "column_12", type = "DoubleType", nullable = TRUE
> |-name = "column_13", type = "InetAddressType", nullable = TRUE
> |-name = "column_14", type = "StringType", nullable = TRUE
> |-name = "column_15", type = "DecimalType(38,0)", nullable = TRUE
> |-name = "column_3", type = "FloatType", nullable = TRUE
> |-name = "column_4", type = "UUIDType", nullable = TRUE
> |-name = "column_5", type = "TimestampType", nullable = TRUE
> |-name = "column_6", type = "BooleanType", nullable = TRUE
> |-name = "column_7", type = "UUIDType", nullable = TRUE
> |-name = "column_8", type = "LongType", nullable = TRUE
> |-name = "column_9", type = "BinaryType", nullable = TRUE
> Schema is correct. 
> > class(df.test)
> [1] "DataFrame"
> attr(,"package")
> [1] "SparkR"
> df.test is clearly defined to be a DataFrame Object. 
> > head(df.test)
>   column_1 column_2 column_10 column_11 column_12 column_13 column_14 
> column_15
> 11helloNANANANANA
> NA
>   column_3 column_4 column_5 column_6 column_7 column_8 column_9
> 1  3.4   NA   NA   NA   NA   NA   NA
> sparkR is reading from the 

[jira] [Commented] (SPARK-11190) SparkR support for cassandra collection types.

2015-10-21 Thread Bilind Hajer (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14967979#comment-14967979
 ] 

Bilind Hajer commented on SPARK-11190:
--

Ok, got master spark version built, I am no longer getting the Error in 
as.data.frame.default(x[[i]], optional = TRUE) when reading a data frame from a 
cassandra column family that contains collection data types. 

But, for example, for a map it is reading the field as something like 


> SparkR support for cassandra collection types. 
> ---
>
> Key: SPARK-11190
> URL: https://issues.apache.org/jira/browse/SPARK-11190
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.5.1
> Environment: SparkR Version: 1.5.1
> Cassandra Version: 2.1.6
> R Version: 3.2.2 
> Cassandra Connector version: 1.5.0-M2
>Reporter: Bilind Hajer
>  Labels: cassandra, dataframe, sparkR
>
> I want to create a data frame from a Cassandra keyspace and column family in 
> sparkR. 
> I am able to create data frames from tables which do not include any 
> Cassandra collection datatypes, 
> such as Map, Set and List.  But, many of the schemas that I need data from, 
> do include these collection data types. 
> Here is my local environment. 
> SparkR Version: 1.5.1
> Cassandra Version: 2.1.6
> R Version: 3.2.2 
> Cassandra Connector version: 1.5.0-M2
> To test this issue, I did the following iterative process. 
> sudo ./sparkR --packages 
> com.datastax.spark:spark-cassandra-connector_2.10:1.5.0-M2 --conf 
> spark.cassandra.connection.host=127.0.0.1
> Running this command, with sparkR gives me access to the spark cassandra 
> connector package I need, 
> and connects me to my local cqlsh server ( which is up and running while 
> running this code in sparkR shell ). 
> CREATE TABLE test_table (
>   column_1 int,
>   column_2 text,
>   column_3 float,
>   column_4 uuid,
>   column_5 timestamp,
>   column_6 boolean,
>   column_7 timeuuid,
>   column_8 bigint,
>   column_9 blob,
>   column_10   ascii,
>   column_11   decimal,
>   column_12   double,
>   column_13   inet,
>   column_14   varchar,
>   column_15   varint,
>   PRIMARY KEY( ( column_1, column_2 ) )
> ); 
> All of the above data types are supported. I insert dummy data after creating 
> this test schema. 
> For example, now in my sparkR shell, I run the following code. 
> df.test  <- read.df(sqlContext,  source = "org.apache.spark.sql.cassandra", 
> keyspace = "datahub", table = "test_table")
> assigns with no errors, then, 
> > schema(df.test)
> StructType
> |-name = "column_1", type = "IntegerType", nullable = TRUE
> |-name = "column_2", type = "StringType", nullable = TRUE
> |-name = "column_10", type = "StringType", nullable = TRUE
> |-name = "column_11", type = "DecimalType(38,18)", nullable = TRUE
> |-name = "column_12", type = "DoubleType", nullable = TRUE
> |-name = "column_13", type = "InetAddressType", nullable = TRUE
> |-name = "column_14", type = "StringType", nullable = TRUE
> |-name = "column_15", type = "DecimalType(38,0)", nullable = TRUE
> |-name = "column_3", type = "FloatType", nullable = TRUE
> |-name = "column_4", type = "UUIDType", nullable = TRUE
> |-name = "column_5", type = "TimestampType", nullable = TRUE
> |-name = "column_6", type = "BooleanType", nullable = TRUE
> |-name = "column_7", type = "UUIDType", nullable = TRUE
> |-name = "column_8", type = "LongType", nullable = TRUE
> |-name = "column_9", type = "BinaryType", nullable = TRUE
> Schema is correct. 
> > class(df.test)
> [1] "DataFrame"
> attr(,"package")
> [1] "SparkR"
> df.test is clearly defined to be a DataFrame Object. 
> > head(df.test)
>   column_1 column_2 column_10 column_11 column_12 column_13 column_14 
> column_15
> 11helloNANANANANA
> NA
>   column_3 column_4 column_5 column_6 column_7 column_8 column_9
> 1  3.4   NA   NA   NA   NA   NA   NA
> sparkR is reading from the column_family correctly, but now lets add a 
> collection data type to the schema. 
> Now I will drop that test_table, and recreate the table, with with an extra 
> column of data type  map
> CREATE TABLE test_table (
>   column_1 int,
>   column_2 text,
>   column_3 float,
>   column_4 uuid,
>   column_5   

[jira] [Commented] (SPARK-11190) SparkR support for cassandra collection types.

2015-10-21 Thread Bilind Hajer (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14967903#comment-14967903
 ] 

Bilind Hajer commented on SPARK-11190:
--

I built spark using mvn, on the master version on github right now, spark-scala 
is able to build, but when I open a sparkR shell, it seems I just have access 
to R, and not spark R. Any ideas? 

> SparkR support for cassandra collection types. 
> ---
>
> Key: SPARK-11190
> URL: https://issues.apache.org/jira/browse/SPARK-11190
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.5.1
> Environment: SparkR Version: 1.5.1
> Cassandra Version: 2.1.6
> R Version: 3.2.2 
> Cassandra Connector version: 1.5.0-M2
>Reporter: Bilind Hajer
>  Labels: cassandra, dataframe, sparkR
>
> I want to create a data frame from a Cassandra keyspace and column family in 
> sparkR. 
> I am able to create data frames from tables which do not include any 
> Cassandra collection datatypes, 
> such as Map, Set and List.  But, many of the schemas that I need data from, 
> do include these collection data types. 
> Here is my local environment. 
> SparkR Version: 1.5.1
> Cassandra Version: 2.1.6
> R Version: 3.2.2 
> Cassandra Connector version: 1.5.0-M2
> To test this issue, I did the following iterative process. 
> sudo ./sparkR --packages 
> com.datastax.spark:spark-cassandra-connector_2.10:1.5.0-M2 --conf 
> spark.cassandra.connection.host=127.0.0.1
> Running this command, with sparkR gives me access to the spark cassandra 
> connector package I need, 
> and connects me to my local cqlsh server ( which is up and running while 
> running this code in sparkR shell ). 
> CREATE TABLE test_table (
>   column_1 int,
>   column_2 text,
>   column_3 float,
>   column_4 uuid,
>   column_5 timestamp,
>   column_6 boolean,
>   column_7 timeuuid,
>   column_8 bigint,
>   column_9 blob,
>   column_10   ascii,
>   column_11   decimal,
>   column_12   double,
>   column_13   inet,
>   column_14   varchar,
>   column_15   varint,
>   PRIMARY KEY( ( column_1, column_2 ) )
> ); 
> All of the above data types are supported. I insert dummy data after creating 
> this test schema. 
> For example, now in my sparkR shell, I run the following code. 
> df.test  <- read.df(sqlContext,  source = "org.apache.spark.sql.cassandra", 
> keyspace = "datahub", table = "test_table")
> assigns with no errors, then, 
> > schema(df.test)
> StructType
> |-name = "column_1", type = "IntegerType", nullable = TRUE
> |-name = "column_2", type = "StringType", nullable = TRUE
> |-name = "column_10", type = "StringType", nullable = TRUE
> |-name = "column_11", type = "DecimalType(38,18)", nullable = TRUE
> |-name = "column_12", type = "DoubleType", nullable = TRUE
> |-name = "column_13", type = "InetAddressType", nullable = TRUE
> |-name = "column_14", type = "StringType", nullable = TRUE
> |-name = "column_15", type = "DecimalType(38,0)", nullable = TRUE
> |-name = "column_3", type = "FloatType", nullable = TRUE
> |-name = "column_4", type = "UUIDType", nullable = TRUE
> |-name = "column_5", type = "TimestampType", nullable = TRUE
> |-name = "column_6", type = "BooleanType", nullable = TRUE
> |-name = "column_7", type = "UUIDType", nullable = TRUE
> |-name = "column_8", type = "LongType", nullable = TRUE
> |-name = "column_9", type = "BinaryType", nullable = TRUE
> Schema is correct. 
> > class(df.test)
> [1] "DataFrame"
> attr(,"package")
> [1] "SparkR"
> df.test is clearly defined to be a DataFrame Object. 
> > head(df.test)
>   column_1 column_2 column_10 column_11 column_12 column_13 column_14 
> column_15
> 11helloNANANANANA
> NA
>   column_3 column_4 column_5 column_6 column_7 column_8 column_9
> 1  3.4   NA   NA   NA   NA   NA   NA
> sparkR is reading from the column_family correctly, but now lets add a 
> collection data type to the schema. 
> Now I will drop that test_table, and recreate the table, with with an extra 
> column of data type  map
> CREATE TABLE test_table (
>   column_1 int,
>   column_2 text,
>   column_3 float,
>   column_4 uuid,
>   column_5 timestamp,
>   column_6 boolean,
>   column_7 

[jira] [Commented] (SPARK-11190) SparkR support for cassandra collection types.

2015-10-21 Thread Bilind Hajer (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14968079#comment-14968079
 ] 

Bilind Hajer commented on SPARK-11190:
--

Well this would be a map datatype from Cassandra, and it should 
convert to the respective R datatype when read in sparkR. I do not think R has 
a map datatype? 

> SparkR support for cassandra collection types. 
> ---
>
> Key: SPARK-11190
> URL: https://issues.apache.org/jira/browse/SPARK-11190
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.5.1
> Environment: SparkR Version: 1.5.1
> Cassandra Version: 2.1.6
> R Version: 3.2.2 
> Cassandra Connector version: 1.5.0-M2
>Reporter: Bilind Hajer
>  Labels: cassandra, dataframe, sparkR
>
> I want to create a data frame from a Cassandra keyspace and column family in 
> sparkR. 
> I am able to create data frames from tables which do not include any 
> Cassandra collection datatypes, 
> such as Map, Set and List.  But, many of the schemas that I need data from, 
> do include these collection data types. 
> Here is my local environment. 
> SparkR Version: 1.5.1
> Cassandra Version: 2.1.6
> R Version: 3.2.2 
> Cassandra Connector version: 1.5.0-M2
> To test this issue, I did the following iterative process. 
> sudo ./sparkR --packages 
> com.datastax.spark:spark-cassandra-connector_2.10:1.5.0-M2 --conf 
> spark.cassandra.connection.host=127.0.0.1
> Running this command, with sparkR gives me access to the spark cassandra 
> connector package I need, 
> and connects me to my local cqlsh server ( which is up and running while 
> running this code in sparkR shell ). 
> CREATE TABLE test_table (
>   column_1 int,
>   column_2 text,
>   column_3 float,
>   column_4 uuid,
>   column_5 timestamp,
>   column_6 boolean,
>   column_7 timeuuid,
>   column_8 bigint,
>   column_9 blob,
>   column_10   ascii,
>   column_11   decimal,
>   column_12   double,
>   column_13   inet,
>   column_14   varchar,
>   column_15   varint,
>   PRIMARY KEY( ( column_1, column_2 ) )
> ); 
> All of the above data types are supported. I insert dummy data after creating 
> this test schema. 
> For example, now in my sparkR shell, I run the following code. 
> df.test  <- read.df(sqlContext,  source = "org.apache.spark.sql.cassandra", 
> keyspace = "datahub", table = "test_table")
> assigns with no errors, then, 
> > schema(df.test)
> StructType
> |-name = "column_1", type = "IntegerType", nullable = TRUE
> |-name = "column_2", type = "StringType", nullable = TRUE
> |-name = "column_10", type = "StringType", nullable = TRUE
> |-name = "column_11", type = "DecimalType(38,18)", nullable = TRUE
> |-name = "column_12", type = "DoubleType", nullable = TRUE
> |-name = "column_13", type = "InetAddressType", nullable = TRUE
> |-name = "column_14", type = "StringType", nullable = TRUE
> |-name = "column_15", type = "DecimalType(38,0)", nullable = TRUE
> |-name = "column_3", type = "FloatType", nullable = TRUE
> |-name = "column_4", type = "UUIDType", nullable = TRUE
> |-name = "column_5", type = "TimestampType", nullable = TRUE
> |-name = "column_6", type = "BooleanType", nullable = TRUE
> |-name = "column_7", type = "UUIDType", nullable = TRUE
> |-name = "column_8", type = "LongType", nullable = TRUE
> |-name = "column_9", type = "BinaryType", nullable = TRUE
> Schema is correct. 
> > class(df.test)
> [1] "DataFrame"
> attr(,"package")
> [1] "SparkR"
> df.test is clearly defined to be a DataFrame Object. 
> > head(df.test)
>   column_1 column_2 column_10 column_11 column_12 column_13 column_14 
> column_15
> 11helloNANANANANA
> NA
>   column_3 column_4 column_5 column_6 column_7 column_8 column_9
> 1  3.4   NA   NA   NA   NA   NA   NA
> sparkR is reading from the column_family correctly, but now lets add a 
> collection data type to the schema. 
> Now I will drop that test_table, and recreate the table, with with an extra 
> column of data type  map
> CREATE TABLE test_table (
>   column_1 int,
>   column_2 text,
>   column_3 float,
>   column_4 uuid,
>   column_5 timestamp,
>   column_6 boolean,
>   column_7 

[jira] [Commented] (SPARK-11190) SparkR support for cassandra collection types.

2015-10-20 Thread Bilind Hajer (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14965391#comment-14965391
 ] 

Bilind Hajer commented on SPARK-11190:
--

Nvm, answered my own question. Will test on Master and let you guys know. 
Thanks. 

> SparkR support for cassandra collection types. 
> ---
>
> Key: SPARK-11190
> URL: https://issues.apache.org/jira/browse/SPARK-11190
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.5.1
> Environment: SparkR Version: 1.5.1
> Cassandra Version: 2.1.6
> R Version: 3.2.2 
> Cassandra Connector version: 1.5.0-M2
>Reporter: Bilind Hajer
>  Labels: cassandra, dataframe, sparkR
>
> I want to create a data frame from a Cassandra keyspace and column family in 
> sparkR. 
> I am able to create data frames from tables which do not include any 
> Cassandra collection datatypes, 
> such as Map, Set and List.  But, many of the schemas that I need data from, 
> do include these collection data types. 
> Here is my local environment. 
> SparkR Version: 1.5.1
> Cassandra Version: 2.1.6
> R Version: 3.2.2 
> Cassandra Connector version: 1.5.0-M2
> To test this issue, I did the following iterative process. 
> sudo ./sparkR --packages 
> com.datastax.spark:spark-cassandra-connector_2.10:1.5.0-M2 --conf 
> spark.cassandra.connection.host=127.0.0.1
> Running this command, with sparkR gives me access to the spark cassandra 
> connector package I need, 
> and connects me to my local cqlsh server ( which is up and running while 
> running this code in sparkR shell ). 
> CREATE TABLE test_table (
>   column_1 int,
>   column_2 text,
>   column_3 float,
>   column_4 uuid,
>   column_5 timestamp,
>   column_6 boolean,
>   column_7 timeuuid,
>   column_8 bigint,
>   column_9 blob,
>   column_10   ascii,
>   column_11   decimal,
>   column_12   double,
>   column_13   inet,
>   column_14   varchar,
>   column_15   varint,
>   PRIMARY KEY( ( column_1, column_2 ) )
> ); 
> All of the above data types are supported. I insert dummy data after creating 
> this test schema. 
> For example, now in my sparkR shell, I run the following code. 
> df.test  <- read.df(sqlContext,  source = "org.apache.spark.sql.cassandra", 
> keyspace = "datahub", table = "test_table")
> assigns with no errors, then, 
> > schema(df.test)
> StructType
> |-name = "column_1", type = "IntegerType", nullable = TRUE
> |-name = "column_2", type = "StringType", nullable = TRUE
> |-name = "column_10", type = "StringType", nullable = TRUE
> |-name = "column_11", type = "DecimalType(38,18)", nullable = TRUE
> |-name = "column_12", type = "DoubleType", nullable = TRUE
> |-name = "column_13", type = "InetAddressType", nullable = TRUE
> |-name = "column_14", type = "StringType", nullable = TRUE
> |-name = "column_15", type = "DecimalType(38,0)", nullable = TRUE
> |-name = "column_3", type = "FloatType", nullable = TRUE
> |-name = "column_4", type = "UUIDType", nullable = TRUE
> |-name = "column_5", type = "TimestampType", nullable = TRUE
> |-name = "column_6", type = "BooleanType", nullable = TRUE
> |-name = "column_7", type = "UUIDType", nullable = TRUE
> |-name = "column_8", type = "LongType", nullable = TRUE
> |-name = "column_9", type = "BinaryType", nullable = TRUE
> Schema is correct. 
> > class(df.test)
> [1] "DataFrame"
> attr(,"package")
> [1] "SparkR"
> df.test is clearly defined to be a DataFrame Object. 
> > head(df.test)
>   column_1 column_2 column_10 column_11 column_12 column_13 column_14 
> column_15
> 11helloNANANANANA
> NA
>   column_3 column_4 column_5 column_6 column_7 column_8 column_9
> 1  3.4   NA   NA   NA   NA   NA   NA
> sparkR is reading from the column_family correctly, but now lets add a 
> collection data type to the schema. 
> Now I will drop that test_table, and recreate the table, with with an extra 
> column of data type  map
> CREATE TABLE test_table (
>   column_1 int,
>   column_2 text,
>   column_3 float,
>   column_4 uuid,
>   column_5 timestamp,
>   column_6 boolean,
>   column_7 timeuuid,
>   column_8 bigint,
>   column_9

[jira] [Commented] (SPARK-11190) SparkR support for cassandra collection types.

2015-10-20 Thread Bilind Hajer (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14965320#comment-14965320
 ] 

Bilind Hajer commented on SPARK-11190:
--

Sorry about that Sean Owen. So I'm assuming this fix is in master branch of 
spark 1.52? 

> SparkR support for cassandra collection types. 
> ---
>
> Key: SPARK-11190
> URL: https://issues.apache.org/jira/browse/SPARK-11190
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.5.1
> Environment: SparkR Version: 1.5.1
> Cassandra Version: 2.1.6
> R Version: 3.2.2 
> Cassandra Connector version: 1.5.0-M2
>Reporter: Bilind Hajer
>  Labels: cassandra, dataframe, sparkR
>
> I want to create a data frame from a Cassandra keyspace and column family in 
> sparkR. 
> I am able to create data frames from tables which do not include any 
> Cassandra collection datatypes, 
> such as Map, Set and List.  But, many of the schemas that I need data from, 
> do include these collection data types. 
> Here is my local environment. 
> SparkR Version: 1.5.1
> Cassandra Version: 2.1.6
> R Version: 3.2.2 
> Cassandra Connector version: 1.5.0-M2
> To test this issue, I did the following iterative process. 
> sudo ./sparkR --packages 
> com.datastax.spark:spark-cassandra-connector_2.10:1.5.0-M2 --conf 
> spark.cassandra.connection.host=127.0.0.1
> Running this command, with sparkR gives me access to the spark cassandra 
> connector package I need, 
> and connects me to my local cqlsh server ( which is up and running while 
> running this code in sparkR shell ). 
> CREATE TABLE test_table (
>   column_1 int,
>   column_2 text,
>   column_3 float,
>   column_4 uuid,
>   column_5 timestamp,
>   column_6 boolean,
>   column_7 timeuuid,
>   column_8 bigint,
>   column_9 blob,
>   column_10   ascii,
>   column_11   decimal,
>   column_12   double,
>   column_13   inet,
>   column_14   varchar,
>   column_15   varint,
>   PRIMARY KEY( ( column_1, column_2 ) )
> ); 
> All of the above data types are supported. I insert dummy data after creating 
> this test schema. 
> For example, now in my sparkR shell, I run the following code. 
> df.test  <- read.df(sqlContext,  source = "org.apache.spark.sql.cassandra", 
> keyspace = "datahub", table = "test_table")
> assigns with no errors, then, 
> > schema(df.test)
> StructType
> |-name = "column_1", type = "IntegerType", nullable = TRUE
> |-name = "column_2", type = "StringType", nullable = TRUE
> |-name = "column_10", type = "StringType", nullable = TRUE
> |-name = "column_11", type = "DecimalType(38,18)", nullable = TRUE
> |-name = "column_12", type = "DoubleType", nullable = TRUE
> |-name = "column_13", type = "InetAddressType", nullable = TRUE
> |-name = "column_14", type = "StringType", nullable = TRUE
> |-name = "column_15", type = "DecimalType(38,0)", nullable = TRUE
> |-name = "column_3", type = "FloatType", nullable = TRUE
> |-name = "column_4", type = "UUIDType", nullable = TRUE
> |-name = "column_5", type = "TimestampType", nullable = TRUE
> |-name = "column_6", type = "BooleanType", nullable = TRUE
> |-name = "column_7", type = "UUIDType", nullable = TRUE
> |-name = "column_8", type = "LongType", nullable = TRUE
> |-name = "column_9", type = "BinaryType", nullable = TRUE
> Schema is correct. 
> > class(df.test)
> [1] "DataFrame"
> attr(,"package")
> [1] "SparkR"
> df.test is clearly defined to be a DataFrame Object. 
> > head(df.test)
>   column_1 column_2 column_10 column_11 column_12 column_13 column_14 
> column_15
> 11helloNANANANANA
> NA
>   column_3 column_4 column_5 column_6 column_7 column_8 column_9
> 1  3.4   NA   NA   NA   NA   NA   NA
> sparkR is reading from the column_family correctly, but now lets add a 
> collection data type to the schema. 
> Now I will drop that test_table, and recreate the table, with with an extra 
> column of data type  map
> CREATE TABLE test_table (
>   column_1 int,
>   column_2 text,
>   column_3 float,
>   column_4 uuid,
>   column_5 timestamp,
>   column_6 boolean,
>   column_7 timeuuid,
>   column_8 bigint,
>   column_9  

[jira] [Commented] (SPARK-11190) SparkR support for cassandra collection types.

2015-10-20 Thread Bilind Hajer (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14965363#comment-14965363
 ] 

Bilind Hajer commented on SPARK-11190:
--

How can I get access to the master branch? Would I be able to clone the current 
repo and test on my local?

> SparkR support for cassandra collection types. 
> ---
>
> Key: SPARK-11190
> URL: https://issues.apache.org/jira/browse/SPARK-11190
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.5.1
> Environment: SparkR Version: 1.5.1
> Cassandra Version: 2.1.6
> R Version: 3.2.2 
> Cassandra Connector version: 1.5.0-M2
>Reporter: Bilind Hajer
>  Labels: cassandra, dataframe, sparkR
>
> I want to create a data frame from a Cassandra keyspace and column family in 
> sparkR. 
> I am able to create data frames from tables which do not include any 
> Cassandra collection datatypes, 
> such as Map, Set and List.  But, many of the schemas that I need data from, 
> do include these collection data types. 
> Here is my local environment. 
> SparkR Version: 1.5.1
> Cassandra Version: 2.1.6
> R Version: 3.2.2 
> Cassandra Connector version: 1.5.0-M2
> To test this issue, I did the following iterative process. 
> sudo ./sparkR --packages 
> com.datastax.spark:spark-cassandra-connector_2.10:1.5.0-M2 --conf 
> spark.cassandra.connection.host=127.0.0.1
> Running this command, with sparkR gives me access to the spark cassandra 
> connector package I need, 
> and connects me to my local cqlsh server ( which is up and running while 
> running this code in sparkR shell ). 
> CREATE TABLE test_table (
>   column_1 int,
>   column_2 text,
>   column_3 float,
>   column_4 uuid,
>   column_5 timestamp,
>   column_6 boolean,
>   column_7 timeuuid,
>   column_8 bigint,
>   column_9 blob,
>   column_10   ascii,
>   column_11   decimal,
>   column_12   double,
>   column_13   inet,
>   column_14   varchar,
>   column_15   varint,
>   PRIMARY KEY( ( column_1, column_2 ) )
> ); 
> All of the above data types are supported. I insert dummy data after creating 
> this test schema. 
> For example, now in my sparkR shell, I run the following code. 
> df.test  <- read.df(sqlContext,  source = "org.apache.spark.sql.cassandra", 
> keyspace = "datahub", table = "test_table")
> assigns with no errors, then, 
> > schema(df.test)
> StructType
> |-name = "column_1", type = "IntegerType", nullable = TRUE
> |-name = "column_2", type = "StringType", nullable = TRUE
> |-name = "column_10", type = "StringType", nullable = TRUE
> |-name = "column_11", type = "DecimalType(38,18)", nullable = TRUE
> |-name = "column_12", type = "DoubleType", nullable = TRUE
> |-name = "column_13", type = "InetAddressType", nullable = TRUE
> |-name = "column_14", type = "StringType", nullable = TRUE
> |-name = "column_15", type = "DecimalType(38,0)", nullable = TRUE
> |-name = "column_3", type = "FloatType", nullable = TRUE
> |-name = "column_4", type = "UUIDType", nullable = TRUE
> |-name = "column_5", type = "TimestampType", nullable = TRUE
> |-name = "column_6", type = "BooleanType", nullable = TRUE
> |-name = "column_7", type = "UUIDType", nullable = TRUE
> |-name = "column_8", type = "LongType", nullable = TRUE
> |-name = "column_9", type = "BinaryType", nullable = TRUE
> Schema is correct. 
> > class(df.test)
> [1] "DataFrame"
> attr(,"package")
> [1] "SparkR"
> df.test is clearly defined to be a DataFrame Object. 
> > head(df.test)
>   column_1 column_2 column_10 column_11 column_12 column_13 column_14 
> column_15
> 11helloNANANANANA
> NA
>   column_3 column_4 column_5 column_6 column_7 column_8 column_9
> 1  3.4   NA   NA   NA   NA   NA   NA
> sparkR is reading from the column_family correctly, but now lets add a 
> collection data type to the schema. 
> Now I will drop that test_table, and recreate the table, with with an extra 
> column of data type  map
> CREATE TABLE test_table (
>   column_1 int,
>   column_2 text,
>   column_3 float,
>   column_4 uuid,
>   column_5 timestamp,
>   column_6 boolean,
>   column_7 timeuuid,
>   column_8 bigint,
>   

[jira] [Created] (SPARK-11190) SparkR support for cassandra collection types.

2015-10-19 Thread Bilind Hajer (JIRA)
Bilind Hajer created SPARK-11190:


 Summary: SparkR support for cassandra collection types. 
 Key: SPARK-11190
 URL: https://issues.apache.org/jira/browse/SPARK-11190
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.5.1
 Environment: SparkR Version: 1.5.1
Cassandra Version: 2.1.6
R Version: 3.2.2 
Cassandra Connector version: 1.5.0-M2
Reporter: Bilind Hajer
 Fix For: 1.5.2


I want to create a data frame from a Cassandra keyspace and column family in 
sparkR. 
I am able to create data frames from tables which do not include any Cassandra 
collection datatypes, 
such as Map, Set and List.  But, many of the schemas that I need data from, do 
include these collection data types. 

Here is my local environment. 

SparkR Version: 1.5.1
Cassandra Version: 2.1.6
R Version: 3.2.2 
Cassandra Connector version: 1.5.0-M2

To test this issue, I did the following iterative process. 

sudo ./sparkR --packages 
com.datastax.spark:spark-cassandra-connector_2.10:1.5.0-M2 --conf 
spark.cassandra.connection.host=127.0.0.1

Running this command, with sparkR gives me access to the spark cassandra 
connector package I need, 
and connects me to my local cqlsh server ( which is up and running while 
running this code in sparkR shell ). 

CREATE TABLE test_table (
  column_1 int,
  column_2 text,
  column_3 float,
  column_4 uuid,
  column_5 timestamp,
  column_6 boolean,
  column_7 timeuuid,
  column_8 bigint,
  column_9 blob,
  column_10   ascii,
  column_11   decimal,
  column_12   double,
  column_13   inet,
  column_14   varchar,
  column_15   varint,
  PRIMARY KEY( ( column_1, column_2 ) )
); 

All of the above data types are supported. I insert dummy data after creating 
this test schema. 

For example, now in my sparkR shell, I run the following code. 

df.test  <- read.df(sqlContext,  source = "org.apache.spark.sql.cassandra", 
keyspace = "datahub", table = "test_table")

assigns with no errors, then, 

> schema(df.test)

StructType
|-name = "column_1", type = "IntegerType", nullable = TRUE
|-name = "column_2", type = "StringType", nullable = TRUE
|-name = "column_10", type = "StringType", nullable = TRUE
|-name = "column_11", type = "DecimalType(38,18)", nullable = TRUE
|-name = "column_12", type = "DoubleType", nullable = TRUE
|-name = "column_13", type = "InetAddressType", nullable = TRUE
|-name = "column_14", type = "StringType", nullable = TRUE
|-name = "column_15", type = "DecimalType(38,0)", nullable = TRUE
|-name = "column_3", type = "FloatType", nullable = TRUE
|-name = "column_4", type = "UUIDType", nullable = TRUE
|-name = "column_5", type = "TimestampType", nullable = TRUE
|-name = "column_6", type = "BooleanType", nullable = TRUE
|-name = "column_7", type = "UUIDType", nullable = TRUE
|-name = "column_8", type = "LongType", nullable = TRUE
|-name = "column_9", type = "BinaryType", nullable = TRUE

Schema is correct. 

> class(df.test)
[1] "DataFrame"
attr(,"package")
[1] "SparkR"

df.test is clearly defined to be a DataFrame Object. 

> head(df.test)

  column_1 column_2 column_10 column_11 column_12 column_13 column_14 column_15
11helloNANANANANANA
  column_3 column_4 column_5 column_6 column_7 column_8 column_9
1  3.4   NA   NA   NA   NA   NA   NA

sparkR is reading from the column_family correctly, but now lets add a 
collection data type to the schema. 

Now I will drop that test_table, and recreate the table, with with an extra 
column of data type  map

CREATE TABLE test_table (
  column_1 int,
  column_2 text,
  column_3 float,
  column_4 uuid,
  column_5 timestamp,
  column_6 boolean,
  column_7 timeuuid,
  column_8 bigint,
  column_9 blob,
  column_10ascii,
  column_11decimal,
  column_12double,
  column_13inet,
  column_14varchar,
  column_15varint,
  column_16map,
  PRIMARY KEY( ( column_1, column_2 ) )
); 

After inserting dummy data into the new test schema, 

> df.test  <- read.df(sqlContext,  source =