[jira] [Assigned] (SPARK-19145) Timestamp to String casting is slowing the query significantly

2017-03-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19145:


Assignee: (was: Apache Spark)

> Timestamp to String casting is slowing the query significantly
> --
>
> Key: SPARK-19145
> URL: https://issues.apache.org/jira/browse/SPARK-19145
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: gagan taneja
>
> i have a time series table with timestamp column 
> Following query
> SELECT COUNT(*) AS `count`
>FROM `default`.`table`
>WHERE `time` >= '2017-01-02 19:53:51'
> AND `time` <= '2017-01-09 19:53:51' LIMIT 5
> is significantly SLOWER than 
> SELECT COUNT(*) AS `count`
> FROM `default`.`table`
> WHERE `time` >= to_utc_timestamp('2017-01-02 19:53:51','-MM-DD 
> HH24:MI:SS−0800')
>   AND `time` <= to_utc_timestamp('2017-01-09 19:53:51','-MM-DD 
> HH24:MI:SS−0800') LIMIT 5
> After investigation i found that in the first query time colum is cast to 
> String before applying the filter 
> However in the second query no such casting is performed and its a filter 
> with long value 
> Below are the generate Physical plan for slower execution followed by 
> physical plan for faster execution 
> SELECT COUNT(*) AS `count`
>FROM `default`.`table`
>WHERE `time` >= '2017-01-02 19:53:51'
> AND `time` <= '2017-01-09 19:53:51' LIMIT 5
> == Physical Plan ==
> CollectLimit 5
> +- *HashAggregate(keys=[], functions=[count(1)], output=[count#3290L])
>+- Exchange SinglePartition
>   +- *HashAggregate(keys=[], functions=[partial_count(1)], 
> output=[count#3339L])
>  +- *Project
> +- *Filter ((isnotnull(time#3314) && (cast(time#3314 as string) 
> >= 2017-01-02 19:53:51)) && (cast(time#3314 as string) <= 2017-01-09 
> 19:53:51))
>+- *FileScan parquet default.cstat[time#3314] Batched: true, 
> Format: Parquet, Location: 
> InMemoryFileIndex[hdfs://10.65.55.220/user/spark/spark-warehouse/cstat], 
> PartitionFilters: [], PushedFilters: [IsNotNull(time)], ReadSchema: 
> struct
> SELECT COUNT(*) AS `count`
> FROM `default`.`table`
> WHERE `time` >= to_utc_timestamp('2017-01-02 19:53:51','-MM-DD 
> HH24:MI:SS−0800')
>   AND `time` <= to_utc_timestamp('2017-01-09 19:53:51','-MM-DD 
> HH24:MI:SS−0800') LIMIT 5
> == Physical Plan ==
> CollectLimit 5
> +- *HashAggregate(keys=[], functions=[count(1)], output=[count#3238L])
>+- Exchange SinglePartition
>   +- *HashAggregate(keys=[], functions=[partial_count(1)], 
> output=[count#3287L])
>  +- *Project
> +- *Filter ((isnotnull(time#3262) && (time#3262 >= 
> 148340483100)) && (time#3262 <= 148400963100))
>+- *FileScan parquet default.cstat[time#3262] Batched: true, 
> Format: Parquet, Location: 
> InMemoryFileIndex[hdfs://10.65.55.220/user/spark/spark-warehouse/cstat], 
> PartitionFilters: [], PushedFilters: [IsNotNull(time), 
> GreaterThanOrEqual(time,2017-01-02 19:53:51.0), 
> LessThanOrEqual(time,2017-01-09..., ReadSchema: struct
> In Impala both query run efficiently without and performance difference
> Spark should be able to parse the Date string and convert to Long/Timestamp 
> during generation of Optimized Logical Plan so that both the query would have 
> similar performance



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19145) Timestamp to String casting is slowing the query significantly

2017-03-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19145:


Assignee: Apache Spark

> Timestamp to String casting is slowing the query significantly
> --
>
> Key: SPARK-19145
> URL: https://issues.apache.org/jira/browse/SPARK-19145
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: gagan taneja
>Assignee: Apache Spark
>
> i have a time series table with timestamp column 
> Following query
> SELECT COUNT(*) AS `count`
>FROM `default`.`table`
>WHERE `time` >= '2017-01-02 19:53:51'
> AND `time` <= '2017-01-09 19:53:51' LIMIT 5
> is significantly SLOWER than 
> SELECT COUNT(*) AS `count`
> FROM `default`.`table`
> WHERE `time` >= to_utc_timestamp('2017-01-02 19:53:51','-MM-DD 
> HH24:MI:SS−0800')
>   AND `time` <= to_utc_timestamp('2017-01-09 19:53:51','-MM-DD 
> HH24:MI:SS−0800') LIMIT 5
> After investigation i found that in the first query time colum is cast to 
> String before applying the filter 
> However in the second query no such casting is performed and its a filter 
> with long value 
> Below are the generate Physical plan for slower execution followed by 
> physical plan for faster execution 
> SELECT COUNT(*) AS `count`
>FROM `default`.`table`
>WHERE `time` >= '2017-01-02 19:53:51'
> AND `time` <= '2017-01-09 19:53:51' LIMIT 5
> == Physical Plan ==
> CollectLimit 5
> +- *HashAggregate(keys=[], functions=[count(1)], output=[count#3290L])
>+- Exchange SinglePartition
>   +- *HashAggregate(keys=[], functions=[partial_count(1)], 
> output=[count#3339L])
>  +- *Project
> +- *Filter ((isnotnull(time#3314) && (cast(time#3314 as string) 
> >= 2017-01-02 19:53:51)) && (cast(time#3314 as string) <= 2017-01-09 
> 19:53:51))
>+- *FileScan parquet default.cstat[time#3314] Batched: true, 
> Format: Parquet, Location: 
> InMemoryFileIndex[hdfs://10.65.55.220/user/spark/spark-warehouse/cstat], 
> PartitionFilters: [], PushedFilters: [IsNotNull(time)], ReadSchema: 
> struct
> SELECT COUNT(*) AS `count`
> FROM `default`.`table`
> WHERE `time` >= to_utc_timestamp('2017-01-02 19:53:51','-MM-DD 
> HH24:MI:SS−0800')
>   AND `time` <= to_utc_timestamp('2017-01-09 19:53:51','-MM-DD 
> HH24:MI:SS−0800') LIMIT 5
> == Physical Plan ==
> CollectLimit 5
> +- *HashAggregate(keys=[], functions=[count(1)], output=[count#3238L])
>+- Exchange SinglePartition
>   +- *HashAggregate(keys=[], functions=[partial_count(1)], 
> output=[count#3287L])
>  +- *Project
> +- *Filter ((isnotnull(time#3262) && (time#3262 >= 
> 148340483100)) && (time#3262 <= 148400963100))
>+- *FileScan parquet default.cstat[time#3262] Batched: true, 
> Format: Parquet, Location: 
> InMemoryFileIndex[hdfs://10.65.55.220/user/spark/spark-warehouse/cstat], 
> PartitionFilters: [], PushedFilters: [IsNotNull(time), 
> GreaterThanOrEqual(time,2017-01-02 19:53:51.0), 
> LessThanOrEqual(time,2017-01-09..., ReadSchema: struct
> In Impala both query run efficiently without and performance difference
> Spark should be able to parse the Date string and convert to Long/Timestamp 
> during generation of Optimized Logical Plan so that both the query would have 
> similar performance



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19145) Timestamp to String casting is slowing the query significantly

2017-03-05 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15896842#comment-15896842
 ] 

Apache Spark commented on SPARK-19145:
--

User 'tanejagagan' has created a pull request for this issue:
https://github.com/apache/spark/pull/17174

> Timestamp to String casting is slowing the query significantly
> --
>
> Key: SPARK-19145
> URL: https://issues.apache.org/jira/browse/SPARK-19145
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: gagan taneja
>
> i have a time series table with timestamp column 
> Following query
> SELECT COUNT(*) AS `count`
>FROM `default`.`table`
>WHERE `time` >= '2017-01-02 19:53:51'
> AND `time` <= '2017-01-09 19:53:51' LIMIT 5
> is significantly SLOWER than 
> SELECT COUNT(*) AS `count`
> FROM `default`.`table`
> WHERE `time` >= to_utc_timestamp('2017-01-02 19:53:51','-MM-DD 
> HH24:MI:SS−0800')
>   AND `time` <= to_utc_timestamp('2017-01-09 19:53:51','-MM-DD 
> HH24:MI:SS−0800') LIMIT 5
> After investigation i found that in the first query time colum is cast to 
> String before applying the filter 
> However in the second query no such casting is performed and its a filter 
> with long value 
> Below are the generate Physical plan for slower execution followed by 
> physical plan for faster execution 
> SELECT COUNT(*) AS `count`
>FROM `default`.`table`
>WHERE `time` >= '2017-01-02 19:53:51'
> AND `time` <= '2017-01-09 19:53:51' LIMIT 5
> == Physical Plan ==
> CollectLimit 5
> +- *HashAggregate(keys=[], functions=[count(1)], output=[count#3290L])
>+- Exchange SinglePartition
>   +- *HashAggregate(keys=[], functions=[partial_count(1)], 
> output=[count#3339L])
>  +- *Project
> +- *Filter ((isnotnull(time#3314) && (cast(time#3314 as string) 
> >= 2017-01-02 19:53:51)) && (cast(time#3314 as string) <= 2017-01-09 
> 19:53:51))
>+- *FileScan parquet default.cstat[time#3314] Batched: true, 
> Format: Parquet, Location: 
> InMemoryFileIndex[hdfs://10.65.55.220/user/spark/spark-warehouse/cstat], 
> PartitionFilters: [], PushedFilters: [IsNotNull(time)], ReadSchema: 
> struct
> SELECT COUNT(*) AS `count`
> FROM `default`.`table`
> WHERE `time` >= to_utc_timestamp('2017-01-02 19:53:51','-MM-DD 
> HH24:MI:SS−0800')
>   AND `time` <= to_utc_timestamp('2017-01-09 19:53:51','-MM-DD 
> HH24:MI:SS−0800') LIMIT 5
> == Physical Plan ==
> CollectLimit 5
> +- *HashAggregate(keys=[], functions=[count(1)], output=[count#3238L])
>+- Exchange SinglePartition
>   +- *HashAggregate(keys=[], functions=[partial_count(1)], 
> output=[count#3287L])
>  +- *Project
> +- *Filter ((isnotnull(time#3262) && (time#3262 >= 
> 148340483100)) && (time#3262 <= 148400963100))
>+- *FileScan parquet default.cstat[time#3262] Batched: true, 
> Format: Parquet, Location: 
> InMemoryFileIndex[hdfs://10.65.55.220/user/spark/spark-warehouse/cstat], 
> PartitionFilters: [], PushedFilters: [IsNotNull(time), 
> GreaterThanOrEqual(time,2017-01-02 19:53:51.0), 
> LessThanOrEqual(time,2017-01-09..., ReadSchema: struct
> In Impala both query run efficiently without and performance difference
> Spark should be able to parse the Date string and convert to Long/Timestamp 
> during generation of Optimized Logical Plan so that both the query would have 
> similar performance



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19832) DynamicPartitionWriteTask should escape the partition name

2017-03-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19832:


Assignee: (was: Apache Spark)

> DynamicPartitionWriteTask should escape the partition name 
> ---
>
> Key: SPARK-19832
> URL: https://issues.apache.org/jira/browse/SPARK-19832
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Song Jun
>
> Currently in DynamicPartitionWriteTask, when we get the paritionPath of a 
> parition, we just escape the partition value, not escape the partition name.
> this will cause some problems for some  special partition name situation, for 
> example :
> 1) if the partition name contains '%' etc,  there will be two partition path 
> created in the filesytem, one is for escaped path like '/path/a%25b=1', 
> another is for unescaped path like '/path/a%b=1'.
> and the data inserted stored in unescaped path, while the show partitions 
> table will return 'a%25b=1' which the partition name is escaped. So here it 
> is not consist. And I think the data should be stored in the escaped path in 
> filesystem, which Hive2.0.0 also have the same action.
> 2) if the partition name contains ':', there will throw exception that new 
> Path("/path","a:b"), this is illegal which has a colon in the relative path.
> {code}
> java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative 
> path in absolute URI: a:b
>   at org.apache.hadoop.fs.Path.initialize(Path.java:205)
>   at org.apache.hadoop.fs.Path.(Path.java:171)
>   at org.apache.hadoop.fs.Path.(Path.java:88)
>   ... 48 elided
> Caused by: java.net.URISyntaxException: Relative path in absolute URI: a:b
>   at java.net.URI.checkPath(URI.java:1823)
>   at java.net.URI.(URI.java:745)
>   at org.apache.hadoop.fs.Path.initialize(Path.java:202)
>   ... 50 more
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19832) DynamicPartitionWriteTask should escape the partition name

2017-03-05 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15896813#comment-15896813
 ] 

Apache Spark commented on SPARK-19832:
--

User 'windpiger' has created a pull request for this issue:
https://github.com/apache/spark/pull/17173

> DynamicPartitionWriteTask should escape the partition name 
> ---
>
> Key: SPARK-19832
> URL: https://issues.apache.org/jira/browse/SPARK-19832
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Song Jun
>
> Currently in DynamicPartitionWriteTask, when we get the paritionPath of a 
> parition, we just escape the partition value, not escape the partition name.
> this will cause some problems for some  special partition name situation, for 
> example :
> 1) if the partition name contains '%' etc,  there will be two partition path 
> created in the filesytem, one is for escaped path like '/path/a%25b=1', 
> another is for unescaped path like '/path/a%b=1'.
> and the data inserted stored in unescaped path, while the show partitions 
> table will return 'a%25b=1' which the partition name is escaped. So here it 
> is not consist. And I think the data should be stored in the escaped path in 
> filesystem, which Hive2.0.0 also have the same action.
> 2) if the partition name contains ':', there will throw exception that new 
> Path("/path","a:b"), this is illegal which has a colon in the relative path.
> {code}
> java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative 
> path in absolute URI: a:b
>   at org.apache.hadoop.fs.Path.initialize(Path.java:205)
>   at org.apache.hadoop.fs.Path.(Path.java:171)
>   at org.apache.hadoop.fs.Path.(Path.java:88)
>   ... 48 elided
> Caused by: java.net.URISyntaxException: Relative path in absolute URI: a:b
>   at java.net.URI.checkPath(URI.java:1823)
>   at java.net.URI.(URI.java:745)
>   at org.apache.hadoop.fs.Path.initialize(Path.java:202)
>   ... 50 more
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19832) DynamicPartitionWriteTask should escape the partition name

2017-03-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19832:


Assignee: Apache Spark

> DynamicPartitionWriteTask should escape the partition name 
> ---
>
> Key: SPARK-19832
> URL: https://issues.apache.org/jira/browse/SPARK-19832
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Song Jun
>Assignee: Apache Spark
>
> Currently in DynamicPartitionWriteTask, when we get the paritionPath of a 
> parition, we just escape the partition value, not escape the partition name.
> this will cause some problems for some  special partition name situation, for 
> example :
> 1) if the partition name contains '%' etc,  there will be two partition path 
> created in the filesytem, one is for escaped path like '/path/a%25b=1', 
> another is for unescaped path like '/path/a%b=1'.
> and the data inserted stored in unescaped path, while the show partitions 
> table will return 'a%25b=1' which the partition name is escaped. So here it 
> is not consist. And I think the data should be stored in the escaped path in 
> filesystem, which Hive2.0.0 also have the same action.
> 2) if the partition name contains ':', there will throw exception that new 
> Path("/path","a:b"), this is illegal which has a colon in the relative path.
> {code}
> java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative 
> path in absolute URI: a:b
>   at org.apache.hadoop.fs.Path.initialize(Path.java:205)
>   at org.apache.hadoop.fs.Path.(Path.java:171)
>   at org.apache.hadoop.fs.Path.(Path.java:88)
>   ... 48 elided
> Caused by: java.net.URISyntaxException: Relative path in absolute URI: a:b
>   at java.net.URI.checkPath(URI.java:1823)
>   at java.net.URI.(URI.java:745)
>   at org.apache.hadoop.fs.Path.initialize(Path.java:202)
>   ... 50 more
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19832) DynamicPartitionWriteTask should escape the partition name

2017-03-05 Thread Song Jun (JIRA)
Song Jun created SPARK-19832:


 Summary: DynamicPartitionWriteTask should escape the partition 
name 
 Key: SPARK-19832
 URL: https://issues.apache.org/jira/browse/SPARK-19832
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: Song Jun


Currently in DynamicPartitionWriteTask, when we get the paritionPath of a 
parition, we just escape the partition value, not escape the partition name.

this will cause some problems for some  special partition name situation, for 
example :
1) if the partition name contains '%' etc,  there will be two partition path 
created in the filesytem, one is for escaped path like '/path/a%25b=1', another 
is for unescaped path like '/path/a%b=1'.
and the data inserted stored in unescaped path, while the show partitions table 
will return 'a%25b=1' which the partition name is escaped. So here it is not 
consist. And I think the data should be stored in the escaped path in 
filesystem, which Hive2.0.0 also have the same action.

2) if the partition name contains ':', there will throw exception that new 
Path("/path","a:b"), this is illegal which has a colon in the relative path.

{code}
java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path 
in absolute URI: a:b
  at org.apache.hadoop.fs.Path.initialize(Path.java:205)
  at org.apache.hadoop.fs.Path.(Path.java:171)
  at org.apache.hadoop.fs.Path.(Path.java:88)
  ... 48 elided
Caused by: java.net.URISyntaxException: Relative path in absolute URI: a:b
  at java.net.URI.checkPath(URI.java:1823)
  at java.net.URI.(URI.java:745)
  at org.apache.hadoop.fs.Path.initialize(Path.java:202)
  ... 50 more
{code}






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19008) Avoid boxing/unboxing overhead of calling a lambda with primitive type from Dataset program

2017-03-05 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15896808#comment-15896808
 ] 

Apache Spark commented on SPARK-19008:
--

User 'kiszk' has created a pull request for this issue:
https://github.com/apache/spark/pull/17172

> Avoid boxing/unboxing overhead of calling a lambda with primitive type from 
> Dataset program
> ---
>
> Key: SPARK-19008
> URL: https://issues.apache.org/jira/browse/SPARK-19008
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Kazuaki Ishizaki
>
> In a 
> [discussion|https://github.com/apache/spark/pull/16391#discussion_r93788919] 
> between [~cloud_fan] and [~kiszk], we noticed an opportunity to avoid 
> boxing/unboxing overhead when a Dataset program calls a lambda, which 
> operates on a primitive type, written in Scala.
> In such a case, Catalyst can directly call a method {{ 
> apply();}} instead of {{Object apply(Object);}}.
> Of course, the best solution seems to be 
> [here|https://issues.apache.org/jira/browse/SPARK-14083].



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19008) Avoid boxing/unboxing overhead of calling a lambda with primitive type from Dataset program

2017-03-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19008:


Assignee: (was: Apache Spark)

> Avoid boxing/unboxing overhead of calling a lambda with primitive type from 
> Dataset program
> ---
>
> Key: SPARK-19008
> URL: https://issues.apache.org/jira/browse/SPARK-19008
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Kazuaki Ishizaki
>
> In a 
> [discussion|https://github.com/apache/spark/pull/16391#discussion_r93788919] 
> between [~cloud_fan] and [~kiszk], we noticed an opportunity to avoid 
> boxing/unboxing overhead when a Dataset program calls a lambda, which 
> operates on a primitive type, written in Scala.
> In such a case, Catalyst can directly call a method {{ 
> apply();}} instead of {{Object apply(Object);}}.
> Of course, the best solution seems to be 
> [here|https://issues.apache.org/jira/browse/SPARK-14083].



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19008) Avoid boxing/unboxing overhead of calling a lambda with primitive type from Dataset program

2017-03-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19008:


Assignee: Apache Spark

> Avoid boxing/unboxing overhead of calling a lambda with primitive type from 
> Dataset program
> ---
>
> Key: SPARK-19008
> URL: https://issues.apache.org/jira/browse/SPARK-19008
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Kazuaki Ishizaki
>Assignee: Apache Spark
>
> In a 
> [discussion|https://github.com/apache/spark/pull/16391#discussion_r93788919] 
> between [~cloud_fan] and [~kiszk], we noticed an opportunity to avoid 
> boxing/unboxing overhead when a Dataset program calls a lambda, which 
> operates on a primitive type, written in Scala.
> In such a case, Catalyst can directly call a method {{ 
> apply();}} instead of {{Object apply(Object);}}.
> Of course, the best solution seems to be 
> [here|https://issues.apache.org/jira/browse/SPARK-14083].



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19831) Sending the heartbeat master from worker maybe blocked by other rpc messages

2017-03-05 Thread hustfxj (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hustfxj updated SPARK-19831:

Description: 
Cleaning the application may cost much time at worker, then it will block that  
the worker send heartbeats master because the worker is extend 
*ThreadSafeRpcEndpoint*. If the heartbeat from a worker  is blocked  by the 
message *ApplicationFinished*,  master will think the worker is dead. If the 
worker has a driver, the driver will be scheduled by master again. So I think 
it is the bug on spark. It may solve this problem by the followed suggests:

1. It had better  put the cleaning the application in a single asynchronous 
thread like 'cleanupThreadExecutor'. Thus it won't block other rpc messages 
like *SendHeartbeat*;

2. It had better not send the heartbeat master by Rpc channel. Because any 
other rpc message may block the rpc channel. It had better send the heartbeat 
master at an asynchronous timing thread .

  was:
Cleaning the application may cost much time at worker, then it will block that  
the worker send heartbeats master because the worker is extend 
*ThreadSafeRpcEndpoint*. If the heartbeat from a worker  is blocked  by the 
message *ApplicationFinished*,  master will think the worker is dead. If the 
worker has a driver, the driver will be scheduled by master again. So I think 
it is the bug on spark. It may solve this problem by the followed suggests:

1. It had better  put the cleaning the application in a single asynchronous 
thread like 'cleanupThreadExecutor'. Thus it won't block other rpc messages 
like SendHeartbeat;

2. It had better not send the heartbeat master by rpc channel. Because any 
other rpc message may block the rpc channel. It had better send the heartbeat 
master at an asynchronous timing thread .


> Sending the heartbeat  master from worker  maybe blocked by other rpc messages
> --
>
> Key: SPARK-19831
> URL: https://issues.apache.org/jira/browse/SPARK-19831
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: hustfxj
>
> Cleaning the application may cost much time at worker, then it will block 
> that  the worker send heartbeats master because the worker is extend 
> *ThreadSafeRpcEndpoint*. If the heartbeat from a worker  is blocked  by the 
> message *ApplicationFinished*,  master will think the worker is dead. If the 
> worker has a driver, the driver will be scheduled by master again. So I think 
> it is the bug on spark. It may solve this problem by the followed suggests:
> 1. It had better  put the cleaning the application in a single asynchronous 
> thread like 'cleanupThreadExecutor'. Thus it won't block other rpc messages 
> like *SendHeartbeat*;
> 2. It had better not send the heartbeat master by Rpc channel. Because any 
> other rpc message may block the rpc channel. It had better send the heartbeat 
> master at an asynchronous timing thread .



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19831) Sending the heartbeat master from worker maybe blocked by other rpc messages

2017-03-05 Thread hustfxj (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hustfxj updated SPARK-19831:

Description: 
Cleaning the application may cost much time at worker, then it will block that  
the worker send heartbeats master because the worker is extend 
*ThreadSafeRpcEndpoint*. If the heartbeat from a worker  is blocked  by the 
message *ApplicationFinished*,  master will think the worker is dead. If the 
worker has a driver, the driver will be scheduled by master again. So I think 
it is the bug on spark. It may solve this problem by the followed suggests:

1. It had better  put the cleaning the application in a single asynchronous 
thread like 'cleanupThreadExecutor'. Thus it won't block other rpc messages 
like SendHeartbeat;

2. It had better not send the heartbeat master by rpc channel. Because any 
other rpc message may block the rpc channel. It had better send the heartbeat 
master at an asynchronous timing thread .

  was:
Cleaning the application may cost much time at worker, then it will block that  
the worker send heartbeats master because the worker is extend 
*ThreadSafeRpcEndpoint*. If the heartbeat from a worker  is blocked  by the 
message *ApplicationFinished*,  master will think the worker is dead. If the 
worker has a driver, the driver will be scheduled by master again. So I think 
it is the bug on spark. I can solve this problem by followed suggests:

1. It had better  put the cleaning the application in a single asynchronous 
thread like 'cleanupThreadExecutor'. Thus it won't block other rpc messages 
like SendHeartbeat;

2. It had better not send the heartbeat master by rpc channel. Because any 
other rpc message may block the rpc channel. It had better send the heartbeat 
master at an asynchronous timing thread .


> Sending the heartbeat  master from worker  maybe blocked by other rpc messages
> --
>
> Key: SPARK-19831
> URL: https://issues.apache.org/jira/browse/SPARK-19831
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: hustfxj
>
> Cleaning the application may cost much time at worker, then it will block 
> that  the worker send heartbeats master because the worker is extend 
> *ThreadSafeRpcEndpoint*. If the heartbeat from a worker  is blocked  by the 
> message *ApplicationFinished*,  master will think the worker is dead. If the 
> worker has a driver, the driver will be scheduled by master again. So I think 
> it is the bug on spark. It may solve this problem by the followed suggests:
> 1. It had better  put the cleaning the application in a single asynchronous 
> thread like 'cleanupThreadExecutor'. Thus it won't block other rpc messages 
> like SendHeartbeat;
> 2. It had better not send the heartbeat master by rpc channel. Because any 
> other rpc message may block the rpc channel. It had better send the heartbeat 
> master at an asynchronous timing thread .



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19831) Sending the heartbeat master from worker maybe blocked by other rpc messages

2017-03-05 Thread hustfxj (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hustfxj updated SPARK-19831:

Description: 
Cleaning the application may cost much time at worker, then it will block that  
the worker send heartbeats master because the worker is extend 
*ThreadSafeRpcEndpoint*. If the heartbeat from a worker  is blocked  by the 
message *ApplicationFinished*,  master will think the worker is dead. If the 
worker has a driver, the driver will be scheduled by master again. So I think 
it is the bug on spark. I can solve this problem by followed suggests:

1. It had better  put the cleaning the application in a single asynchronous 
thread like 'cleanupThreadExecutor'. Thus it won't block other rpc messages 
like SendHeartbeat;

2. It had better not send the heartbeat master by rpc channel. Because any 
other rpc message may block the rpc channel. It had better send the heartbeat 
master at an asynchronous timing thread .

  was:
Cleaning the application may cost much time at worker, then it will block that  
the worker send heartbeats master and rpc messages because the worker is extend 
*ThreadSafeRpcEndpoint*. If the heartbeat from a worker  is blocked  by the 
message *ApplicationFinished*,  master will think the worker is dead. If the 
worker has a driver, the driver will be scheduled by master again. So I think 
it is the bug on spark. I can solve this problem by followed suggests:

1. It had better  put the cleaning the application in a single asynchronous 
thread like 'cleanupThreadExecutor'. Thus it won't block other rpc messages 
like SendHeartbeat;

2. It had better not send the heartbeat master by rpc channel. Because any 
other rpc message may block the rpc channel. It had better send the heartbeat 
master at an asynchronous timing thread .


> Sending the heartbeat  master from worker  maybe blocked by other rpc messages
> --
>
> Key: SPARK-19831
> URL: https://issues.apache.org/jira/browse/SPARK-19831
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: hustfxj
>
> Cleaning the application may cost much time at worker, then it will block 
> that  the worker send heartbeats master because the worker is extend 
> *ThreadSafeRpcEndpoint*. If the heartbeat from a worker  is blocked  by the 
> message *ApplicationFinished*,  master will think the worker is dead. If the 
> worker has a driver, the driver will be scheduled by master again. So I think 
> it is the bug on spark. I can solve this problem by followed suggests:
> 1. It had better  put the cleaning the application in a single asynchronous 
> thread like 'cleanupThreadExecutor'. Thus it won't block other rpc messages 
> like SendHeartbeat;
> 2. It had better not send the heartbeat master by rpc channel. Because any 
> other rpc message may block the rpc channel. It had better send the heartbeat 
> master at an asynchronous timing thread .



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19831) Sending the heartbeat master from worker maybe blocked by other rpc messages

2017-03-05 Thread hustfxj (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hustfxj updated SPARK-19831:

Description: 
Cleaning the application may cost much time at worker, then it will block that  
the worker send heartbeats master and rpc messages because the worker is extend 
*ThreadSafeRpcEndpoint*. If the heartbeat from a worker  is blocked  by the 
message *ApplicationFinished*,  master will think the worker is dead. If the 
worker has a driver, the driver will be scheduled by master again. So I think 
it is the bug on spark. I can solve this problem by followed suggests:

1. It had better  put the cleaning the application in a single asynchronous 
thread like 'cleanupThreadExecutor'. Thus it won't block other rpc messages 
like SendHeartbeat;

2. It had better not send the heartbeat master by rpc channel. Because any 
other rpc message may block the rpc channel. It had better send the heartbeat 
master at an asynchronous timing thread .

  was:
Cleaning the application may cost much time at worker, then it will block that  
the worker send heartbeats master and rpc messages because the worker is extend 
*ThreadSafeRpcEndpoint*. So the master will think the worker is dead. If the 
worker has a driver, the driver will be scheduled by master again. So I think 
it is the bug on spark. I can solve this problem by followed suggests:

1. It had better  put the cleaning the application in a single asynchronous 
thread like 'cleanupThreadExecutor'. Thus it won't block other rpc messages 
like SendHeartbeat;

2. It had better not send the heartbeat master by rpc channel. Because any 
other rpc message may block the rpc channel. It had better send the heartbeat 
master at an asynchronous timing thread .


> Sending the heartbeat  master from worker  maybe blocked by other rpc messages
> --
>
> Key: SPARK-19831
> URL: https://issues.apache.org/jira/browse/SPARK-19831
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: hustfxj
>
> Cleaning the application may cost much time at worker, then it will block 
> that  the worker send heartbeats master and rpc messages because the worker 
> is extend *ThreadSafeRpcEndpoint*. If the heartbeat from a worker  is blocked 
>  by the message *ApplicationFinished*,  master will think the worker is dead. 
> If the worker has a driver, the driver will be scheduled by master again. So 
> I think it is the bug on spark. I can solve this problem by followed suggests:
> 1. It had better  put the cleaning the application in a single asynchronous 
> thread like 'cleanupThreadExecutor'. Thus it won't block other rpc messages 
> like SendHeartbeat;
> 2. It had better not send the heartbeat master by rpc channel. Because any 
> other rpc message may block the rpc channel. It had better send the heartbeat 
> master at an asynchronous timing thread .



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19831) Sending the heartbeat master from worker maybe blocked by other rpc messages

2017-03-05 Thread hustfxj (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hustfxj updated SPARK-19831:

Summary: Sending the heartbeat  master from worker  maybe blocked by other 
rpc messages  (was: Sending the heartbeat to master maybe blocked by other rpc 
messages)

> Sending the heartbeat  master from worker  maybe blocked by other rpc messages
> --
>
> Key: SPARK-19831
> URL: https://issues.apache.org/jira/browse/SPARK-19831
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: hustfxj
>
> Cleaning the application may cost much time at worker, then it will block 
> that  the worker send heartbeats master and rpc messages because the worker 
> is extend *ThreadSafeRpcEndpoint*. So the master will think the worker is 
> dead. If the worker has a driver, the driver will be scheduled by master 
> again. So I think it is the bug on spark. I can solve this problem by 
> followed suggests:
> 1. It had better  put the cleaning the application in a single asynchronous 
> thread like 'cleanupThreadExecutor'. Thus it won't block other rpc messages 
> like SendHeartbeat;
> 2. It had better not send the heartbeat master by rpc channel. Because any 
> other rpc message may block the rpc channel. It had better send the heartbeat 
> master at an asynchronous timing thread .



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19831) Sending the heartbeat to master maybe blocked by other rpc messages

2017-03-05 Thread hustfxj (JIRA)
hustfxj created SPARK-19831:
---

 Summary: Sending the heartbeat to master maybe blocked by other 
rpc messages
 Key: SPARK-19831
 URL: https://issues.apache.org/jira/browse/SPARK-19831
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.2.0
Reporter: hustfxj


Cleaning the application may cost much time at worker, then it will block that  
the worker send heartbeats master and rpc messages because the worker is extend 
*ThreadSafeRpcEndpoint*. So the master will think the worker is dead. If the 
worker has a driver, the driver will be scheduled by master again. So I think 
it is the bug on spark. I can solve this problem by followed suggests:

1. It had better  put the cleaning the application in a single asynchronous 
thread like 'cleanupThreadExecutor'. Thus it won't block other rpc messages 
like SendHeartbeat;

2. It had better not send the heartbeat master by rpc channel. Because any 
other rpc message may block the rpc channel. It had better send the heartbeat 
master at an asynchronous timing thread .



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19578) Poor pyspark performance + incorrect UI input-size metrics

2017-03-05 Thread holdenk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15896751#comment-15896751
 ] 

holdenk commented on SPARK-19578:
-

[~nchammas] That sounds like a pretty good summary from my point of view. Of 
course its possible there are some other performance wins we could find (and it 
might be worth thinking about), but I think our current plan is to focus on 
improving DataFrame performance for PySpark.

> Poor pyspark performance + incorrect UI input-size metrics
> --
>
> Key: SPARK-19578
> URL: https://issues.apache.org/jira/browse/SPARK-19578
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Web UI
>Affects Versions: 1.6.1, 1.6.2, 2.0.1
> Environment: Spark 1.6.2 Hortonworks
> Spark 2.0.1 MapR
> Spark 1.6.1 MapR
>Reporter: Artur Sukhenko
> Attachments: pyspark_incorrect_inputsize.png, reproduce_log, 
> spark_shell_correct_inputsize.png
>
>
> Simple job in pyspark takes 14 minutes to complete.
> The text file used to reproduce contains multiple millions lines of one word 
> "yes"
>  (it might be the cause of poor performance)
> {code}
> var a = sc.textFile("/tmp/yes.txt")
> a.count()
> {code}
> Same code took 33 sec in spark-shell
> Reproduce steps:
> Run this  to generate big file (press Ctrl+C after 5-6 seconds)
> [spark@c6401 ~]$ yes > /tmp/yes.txt
> [spark@c6401 ~]$ ll /tmp/
> -rw-r--r--  1 spark hadoop  516079616 Feb 13 11:10 yes.txt
> [spark@c6401 ~]$ hadoop fs -copyFromLocal /tmp/yes.txt /tmp/
> [spark@c6401 ~]$ pyspark
> {code}
> Python 2.6.6 (r266:84292, Feb 22 2013, 00:00:18) 
> [GCC 4.4.7 20120313 (Red Hat 4.4.7-3)] on linux2
> 17/02/13 11:12:36 INFO SparkContext: Running Spark version 1.6.2
> {code}
> >>> a = sc.textFile("/tmp/yes.txt")
> {code}
> 17/02/13 11:12:58 INFO MemoryStore: Block broadcast_0 stored as values in 
> memory (estimated size 341.1 KB, free 341.1 KB)
> 17/02/13 11:12:58 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes 
> in memory (estimated size 28.3 KB, free 369.4 KB)
> 17/02/13 11:12:58 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory 
> on localhost:43389 (size: 28.3 KB, free: 517.4 MB)
> 17/02/13 11:12:58 INFO SparkContext: Created broadcast 0 from textFile at 
> NativeMethodAccessorImpl.java:-2
> {code}
> >>> a.count()
> {code}
> 17/02/13 11:13:03 INFO FileInputFormat: Total input paths to process : 1
> 17/02/13 11:13:03 INFO SparkContext: Starting job: count at :1
> 17/02/13 11:13:03 INFO DAGScheduler: Got job 0 (count at :1) with 4 
> output partitions
> 17/02/13 11:13:03 INFO DAGScheduler: Final stage: ResultStage 0 (count at 
> :1)
> 17/02/13 11:13:03 INFO DAGScheduler: Parents of final stage: List()
> 17/02/13 11:13:03 INFO DAGScheduler: Missing parents: List()
> 17/02/13 11:13:03 INFO DAGScheduler: Submitting ResultStage 0 (PythonRDD[2] 
> at count at :1), which has no missing parents
> 17/02/13 11:13:03 INFO MemoryStore: Block broadcast_1 stored as values in 
> memory (estimated size 5.7 KB, free 375.1 KB)
> 17/02/13 11:13:03 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes 
> in memory (estimated size 3.5 KB, free 378.6 KB)
> 17/02/13 11:13:03 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory 
> on localhost:43389 (size: 3.5 KB, free: 517.4 MB)
> 17/02/13 11:13:03 INFO SparkContext: Created broadcast 1 from broadcast at 
> DAGScheduler.scala:1008
> 17/02/13 11:13:03 INFO DAGScheduler: Submitting 4 missing tasks from 
> ResultStage 0 (PythonRDD[2] at count at :1)
> 17/02/13 11:13:03 INFO TaskSchedulerImpl: Adding task set 0.0 with 4 tasks
> 17/02/13 11:13:03 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, 
> localhost, partition 0,ANY, 2149 bytes)
> 17/02/13 11:13:03 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
> 17/02/13 11:13:03 INFO HadoopRDD: Input split: 
> hdfs://c6401.ambari.apache.org:8020/tmp/yes.txt:0+134217728
> 17/02/13 11:13:03 INFO deprecation: mapred.tip.id is deprecated. Instead, use 
> mapreduce.task.id
> 17/02/13 11:13:03 INFO deprecation: mapred.task.id is deprecated. Instead, 
> use mapreduce.task.attempt.id
> 17/02/13 11:13:03 INFO deprecation: mapred.task.is.map is deprecated. 
> Instead, use mapreduce.task.ismap
> 17/02/13 11:13:03 INFO deprecation: mapred.task.partition is deprecated. 
> Instead, use mapreduce.task.partition
> 17/02/13 11:13:03 INFO deprecation: mapred.job.id is deprecated. Instead, use 
> mapreduce.job.id
> 17/02/13 11:16:37 INFO PythonRunner: Times: total = 213335, boot = 317, init 
> = 445, finish = 212573
> 17/02/13 11:16:37 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 2182 
> bytes result sent to driver
> 17/02/13 11:16:37 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, 
> localhost, partition 1,ANY, 2149 bytes)
> 17/02/13 11:16:37 INFO Executor: Running task 1.0 in stage 0.0 

[jira] [Commented] (SPARK-19067) mapGroupsWithState - arbitrary stateful operations with Structured Streaming (similar to DStream.mapWithState)

2017-03-05 Thread Tathagata Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15896741#comment-15896741
 ] 

Tathagata Das commented on SPARK-19067:
---

Hey [~amitsela]
Apologies for not noticing this comment earlier. I am in the process of adding 
timeouts to mapGroupsWithState. Will update the JIRA soon. Let me know how much 
that helps.


> mapGroupsWithState - arbitrary stateful operations with Structured Streaming 
> (similar to DStream.mapWithState)
> --
>
> Key: SPARK-19067
> URL: https://issues.apache.org/jira/browse/SPARK-19067
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Reporter: Michael Armbrust
>Assignee: Tathagata Das
>Priority: Critical
>
> Right now the only way to do stateful operations with with Aggregator or 
> UDAF.  However, this does not give users control of emission or expiration of 
> state making it hard to implement things like sessionization.  We should add 
> a more general construct (probably similar to {{DStream.mapWithState}}) to 
> structured streaming. Here is the design. 
> *Requirements*
> - Users should be able to specify a function that can do the following
> - Access the input row corresponding to a key
> - Access the previous state corresponding to a key
> - Optionally, update or remove the state
> - Output any number of new rows (or none at all)
> *Proposed API*
> {code}
> //  New methods on KeyValueGroupedDataset 
> class KeyValueGroupedDataset[K, V] {  
>   // Scala friendly
>   def mapGroupsWithState[S: Encoder, U: Encoder](func: (K, Iterator[V], 
> State[S]) => U)
> def flatMapGroupsWithState[S: Encode, U: Encoder](func: (K, 
> Iterator[V], State[S]) => Iterator[U])
>   // Java friendly
>def mapGroupsWithState[S, U](func: MapGroupsWithStateFunction[K, V, S, 
> R], stateEncoder: Encoder[S], resultEncoder: Encoder[U])
>def flatMapGroupsWithState[S, U](func: 
> FlatMapGroupsWithStateFunction[K, V, S, R], stateEncoder: Encoder[S], 
> resultEncoder: Encoder[U])
> }
> // --- New Java-friendly function classes --- 
> public interface MapGroupsWithStateFunction extends Serializable {
>   R call(K key, Iterator values, state: State) throws Exception;
> }
> public interface FlatMapGroupsWithStateFunction extends 
> Serializable {
>   Iterator call(K key, Iterator values, state: State) throws 
> Exception;
> }
> // -- Wrapper class for state data -- 
> trait KeyedState[S] {
>   def exists(): Boolean   
>   def get(): S// throws Exception is state does not 
> exist
>   def getOption(): Option[S]   
>   def update(newState: S): Unit
>   def remove(): Unit  // exists() will be false after this
> }
> {code}
> Key Semantics of the State class
> - The state can be null.
> - If the state.remove() is called, then state.exists() will return false, and 
> getOption will returm None.
> - After that state.update(newState) is called, then state.exists() will 
> return true, and getOption will return Some(...). 
> - None of the operations are thread-safe. This is to avoid memory barriers.
> *Usage*
> {code}
> val stateFunc = (word: String, words: Iterator[String, runningCount: 
> KeyedState[Long]) => {
> val newCount = words.size + runningCount.getOption.getOrElse(0L)
> runningCount.update(newCount)
>(word, newCount)
> }
> dataset   // type 
> is Dataset[String]
>   .groupByKey[String](w => w) // generates 
> KeyValueGroupedDataset[String, String]
>   .mapGroupsWithState[Long, (String, Long)](stateFunc)// returns 
> Dataset[(String, Long)]
> {code}
> *Future Directions*
> - Timeout based state expiration (that has not received data for a while)
> - General expression based expiration 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19830) Add parseTableSchema API to ParserInterface

2017-03-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19830:


Assignee: Apache Spark  (was: Xiao Li)

> Add parseTableSchema API to ParserInterface
> ---
>
> Key: SPARK-19830
> URL: https://issues.apache.org/jira/browse/SPARK-19830
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Xiao Li
>Assignee: Apache Spark
>
> Specifying the table schema in DDL formats is needed for different scenarios. 
> For example, specifying the schema in SQL function {{from_json}}, and 
> specifying the customized JDBC data types. In the submitted PRs, the idea is 
> to ask users to specify the table schema in the JSON format. This is not user 
> friendly. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19830) Add parseTableSchema API to ParserInterface

2017-03-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19830:


Assignee: Xiao Li  (was: Apache Spark)

> Add parseTableSchema API to ParserInterface
> ---
>
> Key: SPARK-19830
> URL: https://issues.apache.org/jira/browse/SPARK-19830
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>
> Specifying the table schema in DDL formats is needed for different scenarios. 
> For example, specifying the schema in SQL function {{from_json}}, and 
> specifying the customized JDBC data types. In the submitted PRs, the idea is 
> to ask users to specify the table schema in the JSON format. This is not user 
> friendly. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19830) Add parseTableSchema API to ParserInterface

2017-03-05 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15896736#comment-15896736
 ] 

Apache Spark commented on SPARK-19830:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/17171

> Add parseTableSchema API to ParserInterface
> ---
>
> Key: SPARK-19830
> URL: https://issues.apache.org/jira/browse/SPARK-19830
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>
> Specifying the table schema in DDL formats is needed for different scenarios. 
> For example, specifying the schema in SQL function {{from_json}}, and 
> specifying the customized JDBC data types. In the submitted PRs, the idea is 
> to ask users to specify the table schema in the JSON format. This is not user 
> friendly. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19830) Add parseTableSchema API to ParserInterface

2017-03-05 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-19830:

Description: 
Specifying the table schema in DDL formats is needed for different scenarios. 
For example, specifying the schema in SQL function {{from_json}}, and 
specifying the customized JDBC data types. In the submitted PRs, the idea is to 
ask users to specify the table schema in the JSON format. This is not user 
friendly. 


  was:
Specifying the table schema in DDL formats is needed for different scenarios. 
For example, specifying the schema in SQL function {from_json}, and specifying 
the customized JDBC data types. In the submitted PRs, the idea is to ask users 
to specify the table schema in the JSON format. This is not user friendly. 



> Add parseTableSchema API to ParserInterface
> ---
>
> Key: SPARK-19830
> URL: https://issues.apache.org/jira/browse/SPARK-19830
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>
> Specifying the table schema in DDL formats is needed for different scenarios. 
> For example, specifying the schema in SQL function {{from_json}}, and 
> specifying the customized JDBC data types. In the submitted PRs, the idea is 
> to ask users to specify the table schema in the JSON format. This is not user 
> friendly. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19830) Add parseTableSchema API to ParserInterface

2017-03-05 Thread Xiao Li (JIRA)
Xiao Li created SPARK-19830:
---

 Summary: Add parseTableSchema API to ParserInterface
 Key: SPARK-19830
 URL: https://issues.apache.org/jira/browse/SPARK-19830
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.1.0
Reporter: Xiao Li
Assignee: Xiao Li


Specifying the table schema in DDL formats is needed for different scenarios. 
For example, specifying the schema in SQL function {from_json}, and specifying 
the customized JDBC data types. In the submitted PRs, the idea is to ask users 
to specify the table schema in the JSON format. This is not user friendly. 




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-19815) Not orderable should be applied to right key instead of left key

2017-03-05 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li closed SPARK-19815.
---
Resolution: Won't Fix

> Not orderable should be applied to right key instead of left key
> 
>
> Key: SPARK-19815
> URL: https://issues.apache.org/jira/browse/SPARK-19815
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Zhan Zhang
>Priority: Minor
>
> When generating ShuffledHashJoinExec, the orderable condition should be 
> applied to right key instead of left key.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19829) The log about driver should support rolling like executor

2017-03-05 Thread hustfxj (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hustfxj updated SPARK-19829:

Description: 
We should rollback the log of the driver , or the log maybe large!!! 

{code:title=DriverRunner.java|borderStyle=solid}
// modify the runDriver
  private def runDriver(builder: ProcessBuilder, baseDir: File, supervise: 
Boolean): Int = {
builder.directory(baseDir)
def initialize(process: Process): Unit = {
  // Redirect stdout and stderr to files-- the old code
//  val stdout = new File(baseDir, "stdout")
//  CommandUtils.redirectStream(process.getInputStream, stdout)
//
//  val stderr = new File(baseDir, "stderr")
//  val formattedCommand = builder.command.asScala.mkString("\"", "\" \"", 
"\"")
//  val header = "Launch Command: %s\n%s\n\n".format(formattedCommand, "=" 
* 40)
//  Files.append(header, stderr, StandardCharsets.UTF_8)
//  CommandUtils.redirectStream(process.getErrorStream, stderr)

  // Redirect its stdout and stderr to files-support rolling
  val stdout = new File(baseDir, "stdout")
  stdoutAppender = FileAppender(process.getInputStream, stdout, conf)
  val stderr = new File(baseDir, "stderr")
  val formattedCommand = builder.command.asScala.mkString("\"", "\" \"", 
"\"")
  val header = "Launch Command: %s\n%s\n\n".format(formattedCommand, "=" * 
40)
  Files.append(header, stderr, StandardCharsets.UTF_8)
  stderrAppender = FileAppender(process.getErrorStream, stderr, conf)
}
runCommandWithRetry(ProcessBuilderLike(builder), initialize, supervise)
  }
{code}



  was:
We should rollback the log of the driver , or the log maybe large!!! 

{code:title=Bar.java|borderStyle=solid}
// modify the runDriver
  private def runDriver(builder: ProcessBuilder, baseDir: File, supervise: 
Boolean): Int = {
builder.directory(baseDir)
def initialize(process: Process): Unit = {
  // Redirect stdout and stderr to files-- the old code
//  val stdout = new File(baseDir, "stdout")
//  CommandUtils.redirectStream(process.getInputStream, stdout)
//
//  val stderr = new File(baseDir, "stderr")
//  val formattedCommand = builder.command.asScala.mkString("\"", "\" \"", 
"\"")
//  val header = "Launch Command: %s\n%s\n\n".format(formattedCommand, "=" 
* 40)
//  Files.append(header, stderr, StandardCharsets.UTF_8)
//  CommandUtils.redirectStream(process.getErrorStream, stderr)

  // Redirect its stdout and stderr to files-support rolling
  val stdout = new File(baseDir, "stdout")
  stdoutAppender = FileAppender(process.getInputStream, stdout, conf)
  val stderr = new File(baseDir, "stderr")
  val formattedCommand = builder.command.asScala.mkString("\"", "\" \"", 
"\"")
  val header = "Launch Command: %s\n%s\n\n".format(formattedCommand, "=" * 
40)
  Files.append(header, stderr, StandardCharsets.UTF_8)
  stderrAppender = FileAppender(process.getErrorStream, stderr, conf)
}
runCommandWithRetry(ProcessBuilderLike(builder), initialize, supervise)
  }
{code}




> The log about driver should support rolling like executor
> -
>
> Key: SPARK-19829
> URL: https://issues.apache.org/jira/browse/SPARK-19829
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: hustfxj
>
> We should rollback the log of the driver , or the log maybe large!!! 
> {code:title=DriverRunner.java|borderStyle=solid}
> // modify the runDriver
>   private def runDriver(builder: ProcessBuilder, baseDir: File, supervise: 
> Boolean): Int = {
> builder.directory(baseDir)
> def initialize(process: Process): Unit = {
>   // Redirect stdout and stderr to files-- the old code
> //  val stdout = new File(baseDir, "stdout")
> //  CommandUtils.redirectStream(process.getInputStream, stdout)
> //
> //  val stderr = new File(baseDir, "stderr")
> //  val formattedCommand = builder.command.asScala.mkString("\"", "\" 
> \"", "\"")
> //  val header = "Launch Command: %s\n%s\n\n".format(formattedCommand, 
> "=" * 40)
> //  Files.append(header, stderr, StandardCharsets.UTF_8)
> //  CommandUtils.redirectStream(process.getErrorStream, stderr)
>   // Redirect its stdout and stderr to files-support rolling
>   val stdout = new File(baseDir, "stdout")
>   stdoutAppender = FileAppender(process.getInputStream, stdout, conf)
>   val stderr = new File(baseDir, "stderr")
>   val formattedCommand = builder.command.asScala.mkString("\"", "\" \"", 
> "\"")
>   val header = "Launch Command: %s\n%s\n\n".format(formattedCommand, "=" 
> * 40)
>   Files.append(header, stderr, StandardCharsets.UTF_8)
>   stderrAppender = 

[jira] [Created] (SPARK-19829) The log about driver should support rolling like executor

2017-03-05 Thread hustfxj (JIRA)
hustfxj created SPARK-19829:
---

 Summary: The log about driver should support rolling like executor
 Key: SPARK-19829
 URL: https://issues.apache.org/jira/browse/SPARK-19829
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: hustfxj


We should rollback the log of the driver , or the log maybe large!!! 

{code:title=Bar.java|borderStyle=solid}
// modify the runDriver
  private def runDriver(builder: ProcessBuilder, baseDir: File, supervise: 
Boolean): Int = {
builder.directory(baseDir)
def initialize(process: Process): Unit = {
  // Redirect stdout and stderr to files-- the old code
//  val stdout = new File(baseDir, "stdout")
//  CommandUtils.redirectStream(process.getInputStream, stdout)
//
//  val stderr = new File(baseDir, "stderr")
//  val formattedCommand = builder.command.asScala.mkString("\"", "\" \"", 
"\"")
//  val header = "Launch Command: %s\n%s\n\n".format(formattedCommand, "=" 
* 40)
//  Files.append(header, stderr, StandardCharsets.UTF_8)
//  CommandUtils.redirectStream(process.getErrorStream, stderr)

  // Redirect its stdout and stderr to files-support rolling
  val stdout = new File(baseDir, "stdout")
  stdoutAppender = FileAppender(process.getInputStream, stdout, conf)
  val stderr = new File(baseDir, "stderr")
  val formattedCommand = builder.command.asScala.mkString("\"", "\" \"", 
"\"")
  val header = "Launch Command: %s\n%s\n\n".format(formattedCommand, "=" * 
40)
  Files.append(header, stderr, StandardCharsets.UTF_8)
  stderrAppender = FileAppender(process.getErrorStream, stderr, conf)
}
runCommandWithRetry(ProcessBuilderLike(builder), initialize, supervise)
  }
{code}





--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19825) spark.ml R API for FPGrowth

2017-03-05 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15896665#comment-15896665
 ] 

Apache Spark commented on SPARK-19825:
--

User 'zero323' has created a pull request for this issue:
https://github.com/apache/spark/pull/17170

> spark.ml R API for FPGrowth
> ---
>
> Key: SPARK-19825
> URL: https://issues.apache.org/jira/browse/SPARK-19825
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, SparkR
>Affects Versions: 2.1.0
>Reporter: Felix Cheung
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19825) spark.ml R API for FPGrowth

2017-03-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19825:


Assignee: Apache Spark

> spark.ml R API for FPGrowth
> ---
>
> Key: SPARK-19825
> URL: https://issues.apache.org/jira/browse/SPARK-19825
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, SparkR
>Affects Versions: 2.1.0
>Reporter: Felix Cheung
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19825) spark.ml R API for FPGrowth

2017-03-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19825:


Assignee: (was: Apache Spark)

> spark.ml R API for FPGrowth
> ---
>
> Key: SPARK-19825
> URL: https://issues.apache.org/jira/browse/SPARK-19825
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, SparkR
>Affects Versions: 2.1.0
>Reporter: Felix Cheung
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18085) Better History Server scalability for many / large applications

2017-03-05 Thread DjvuLee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15896652#comment-15896652
 ] 

DjvuLee commented on SPARK-18085:
-

"A separate jar file" means we generate a new jar file for the history 
function, just like Spark put `network` function in a new jar file, not in the 
Spark-core,I just want do not impact the existing jar file。

You can update the information when this design is ready for complete, maybe 
many people want to try it.

Thanks for your reply!


> Better History Server scalability for many / large applications
> ---
>
> Key: SPARK-18085
> URL: https://issues.apache.org/jira/browse/SPARK-18085
> Project: Spark
>  Issue Type: Umbrella
>  Components: Spark Core, Web UI
>Affects Versions: 2.0.0
>Reporter: Marcelo Vanzin
> Attachments: spark_hs_next_gen.pdf
>
>
> It's a known fact that the History Server currently has some annoying issues 
> when serving lots of applications, and when serving large applications.
> I'm filing this umbrella to track work related to addressing those issues. 
> I'll be attaching a document shortly describing the issues and suggesting a 
> path to how to solve them.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19822) CheckpointSuite.testCheckpointedOperation: should not check checkpointFilesOfLatestTime by the PATH string.

2017-03-05 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-19822.
--
   Resolution: Fixed
 Assignee: Genmao Yu
Fix Version/s: 2.2.0
   2.1.1
   2.0.3

> CheckpointSuite.testCheckpointedOperation: should not check 
> checkpointFilesOfLatestTime by the PATH string.
> ---
>
> Key: SPARK-19822
> URL: https://issues.apache.org/jira/browse/SPARK-19822
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Genmao Yu
>Assignee: Genmao Yu
>Priority: Minor
> Fix For: 2.0.3, 2.1.1, 2.2.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19701) the `in` operator in pyspark is broken

2017-03-05 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-19701:
---

Assignee: Hyukjin Kwon

> the `in` operator in pyspark is broken
> --
>
> Key: SPARK-19701
> URL: https://issues.apache.org/jira/browse/SPARK-19701
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: Wenchen Fan
>Assignee: Hyukjin Kwon
> Fix For: 2.2.0
>
>
> {code}
> >>> textFile = spark.read.text("/Users/cloud/dev/spark/README.md")
> >>> linesWithSpark = textFile.filter("Spark" in textFile.value)
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/Users/cloud/product/spark/python/pyspark/sql/column.py", line 426, 
> in __nonzero__
> raise ValueError("Cannot convert column into bool: please use '&' for 
> 'and', '|' for 'or', "
> ValueError: Cannot convert column into bool: please use '&' for 'and', '|' 
> for 'or', '~' for 'not' when building DataFrame boolean expressions.
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19701) the `in` operator in pyspark is broken

2017-03-05 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-19701.
-
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 17160
[https://github.com/apache/spark/pull/17160]

> the `in` operator in pyspark is broken
> --
>
> Key: SPARK-19701
> URL: https://issues.apache.org/jira/browse/SPARK-19701
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: Wenchen Fan
> Fix For: 2.2.0
>
>
> {code}
> >>> textFile = spark.read.text("/Users/cloud/dev/spark/README.md")
> >>> linesWithSpark = textFile.filter("Spark" in textFile.value)
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/Users/cloud/product/spark/python/pyspark/sql/column.py", line 426, 
> in __nonzero__
> raise ValueError("Cannot convert column into bool: please use '&' for 
> 'and', '|' for 'or', "
> ValueError: Cannot convert column into bool: please use '&' for 'and', '|' 
> for 'or', '~' for 'not' when building DataFrame boolean expressions.
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19535) ALSModel recommendAll analogs

2017-03-05 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-19535.
---
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 17090
[https://github.com/apache/spark/pull/17090]

> ALSModel recommendAll analogs
> -
>
> Key: SPARK-19535
> URL: https://issues.apache.org/jira/browse/SPARK-19535
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Joseph K. Bradley
>Assignee: Sue Ann Hong
> Fix For: 2.2.0
>
>
> Add methods analogous to the spark.mllib MatrixFactorizationModel methods 
> recommendProductsForUsers/UsersForProducts.
> The initial implementation should be very simple, using DataFrame joins.  
> Future work can add optimizations.
> I recommend naming them:
> * recommendForAllUsers
> * recommendForAllItems



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19828) R to support JSON array in column from_json

2017-03-05 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-19828:
-
Summary: R to support JSON array in column from_json  (was: R support JSON 
array in column from_json)

> R to support JSON array in column from_json
> ---
>
> Key: SPARK-19828
> URL: https://issues.apache.org/jira/browse/SPARK-19828
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR, SQL
>Affects Versions: 2.2.0
>Reporter: Felix Cheung
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19828) R support JSON array in column from_json

2017-03-05 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15896622#comment-15896622
 ] 

Felix Cheung commented on SPARK-19828:
--

see SPARK-19595

> R support JSON array in column from_json
> 
>
> Key: SPARK-19828
> URL: https://issues.apache.org/jira/browse/SPARK-19828
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR, SQL
>Affects Versions: 2.2.0
>Reporter: Felix Cheung
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19828) R support JSON array in column from_json

2017-03-05 Thread Felix Cheung (JIRA)
Felix Cheung created SPARK-19828:


 Summary: R support JSON array in column from_json
 Key: SPARK-19828
 URL: https://issues.apache.org/jira/browse/SPARK-19828
 Project: Spark
  Issue Type: Bug
  Components: SparkR, SQL
Affects Versions: 2.2.0
Reporter: Felix Cheung






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18085) Better History Server scalability for many / large applications

2017-03-05 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15896619#comment-15896619
 ] 

Marcelo Vanzin commented on SPARK-18085:


Not sure what you mean by "a separate jar file". There's code that I linked to 
if you want to try. But it's not complete, probably has tons of bugs, and it's 
nowhere near usable at the moment.

> Better History Server scalability for many / large applications
> ---
>
> Key: SPARK-18085
> URL: https://issues.apache.org/jira/browse/SPARK-18085
> Project: Spark
>  Issue Type: Umbrella
>  Components: Spark Core, Web UI
>Affects Versions: 2.0.0
>Reporter: Marcelo Vanzin
> Attachments: spark_hs_next_gen.pdf
>
>
> It's a known fact that the History Server currently has some annoying issues 
> when serving lots of applications, and when serving large applications.
> I'm filing this umbrella to track work related to addressing those issues. 
> I'll be attaching a document shortly describing the issues and suggesting a 
> path to how to solve them.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19765) UNCACHE TABLE should also un-cache all cached plans that refer to this table

2017-03-05 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-19765:

Summary: UNCACHE TABLE should also un-cache all cached plans that refer to 
this table  (was: UNCACHE TABLE should also re-cache all cached plans that 
refer to this table)

> UNCACHE TABLE should also un-cache all cached plans that refer to this table
> 
>
> Key: SPARK-19765
> URL: https://issues.apache.org/jira/browse/SPARK-19765
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19656) Can't load custom type from avro file to RDD with newAPIHadoopFile

2017-03-05 Thread Nira Amit (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15896590#comment-15896590
 ] 

Nira Amit commented on SPARK-19656:
---

I will not, but please consider documenting the correct way to work with the 
newAPIHadoopFile in Java. It is not as easy as working with it in Scala and 
I've been googling this enough to know that it is not clear to many Java 
developers who try to use it.

> Can't load custom type from avro file to RDD with newAPIHadoopFile
> --
>
> Key: SPARK-19656
> URL: https://issues.apache.org/jira/browse/SPARK-19656
> Project: Spark
>  Issue Type: Question
>  Components: Java API
>Affects Versions: 2.0.2
>Reporter: Nira Amit
>
> If I understand correctly, in scala it's possible to load custom objects from 
> avro files to RDDs this way:
> {code}
> ctx.hadoopFile("/path/to/the/avro/file.avro",
>   classOf[AvroInputFormat[MyClassInAvroFile]],
>   classOf[AvroWrapper[MyClassInAvroFile]],
>   classOf[NullWritable])
> {code}
> I'm not a scala developer, so I tried to "translate" this to java as best I 
> could. I created classes that extend AvroKey and FileInputFormat:
> {code}
> public static class MyCustomAvroKey extends AvroKey{};
> public static class MyCustomAvroReader extends 
> AvroRecordReaderBase {
> // with my custom schema and all the required methods...
> }
> public static class MyCustomInputFormat extends 
> FileInputFormat{
> @Override
> public RecordReader 
> createRecordReader(InputSplit inputSplit, TaskAttemptContext 
> taskAttemptContext) throws IOException, InterruptedException {
> return new MyCustomAvroReader();
> }
> }
> ...
> JavaPairRDD records =
> sc.newAPIHadoopFile("file:/path/to/datafile.avro",
> MyCustomInputFormat.class, MyCustomAvroKey.class,
> NullWritable.class,
> sc.hadoopConfiguration());
> MyCustomClass first = records.first()._1.datum();
> System.out.println("Got a result, some custom field: " + 
> first.getSomeCustomField());
> {code}
> This compiles fine, but using a debugger I can see that `first._1.datum()` 
> actually returns a `GenericData$Record` in runtime, not a `MyCustomClass` 
> instance.
> And indeed, when the following line executes:
> {code}
> MyCustomClass first = records.first()._1.datum();
> {code}
> I get an exception:
> {code}
> java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record 
> cannot be cast to my.package.containing.MyCustomClass
> {code}
> Am I doing it wrong? Or is this not possible in Java?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-19656) Can't load custom type from avro file to RDD with newAPIHadoopFile

2017-03-05 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen closed SPARK-19656.
-

> Can't load custom type from avro file to RDD with newAPIHadoopFile
> --
>
> Key: SPARK-19656
> URL: https://issues.apache.org/jira/browse/SPARK-19656
> Project: Spark
>  Issue Type: Question
>  Components: Java API
>Affects Versions: 2.0.2
>Reporter: Nira Amit
>
> If I understand correctly, in scala it's possible to load custom objects from 
> avro files to RDDs this way:
> {code}
> ctx.hadoopFile("/path/to/the/avro/file.avro",
>   classOf[AvroInputFormat[MyClassInAvroFile]],
>   classOf[AvroWrapper[MyClassInAvroFile]],
>   classOf[NullWritable])
> {code}
> I'm not a scala developer, so I tried to "translate" this to java as best I 
> could. I created classes that extend AvroKey and FileInputFormat:
> {code}
> public static class MyCustomAvroKey extends AvroKey{};
> public static class MyCustomAvroReader extends 
> AvroRecordReaderBase {
> // with my custom schema and all the required methods...
> }
> public static class MyCustomInputFormat extends 
> FileInputFormat{
> @Override
> public RecordReader 
> createRecordReader(InputSplit inputSplit, TaskAttemptContext 
> taskAttemptContext) throws IOException, InterruptedException {
> return new MyCustomAvroReader();
> }
> }
> ...
> JavaPairRDD records =
> sc.newAPIHadoopFile("file:/path/to/datafile.avro",
> MyCustomInputFormat.class, MyCustomAvroKey.class,
> NullWritable.class,
> sc.hadoopConfiguration());
> MyCustomClass first = records.first()._1.datum();
> System.out.println("Got a result, some custom field: " + 
> first.getSomeCustomField());
> {code}
> This compiles fine, but using a debugger I can see that `first._1.datum()` 
> actually returns a `GenericData$Record` in runtime, not a `MyCustomClass` 
> instance.
> And indeed, when the following line executes:
> {code}
> MyCustomClass first = records.first()._1.datum();
> {code}
> I get an exception:
> {code}
> java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record 
> cannot be cast to my.package.containing.MyCustomClass
> {code}
> Am I doing it wrong? Or is this not possible in Java?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19595) from_json produces only a single row when input is a json array

2017-03-05 Thread Burak Yavuz (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Burak Yavuz resolved SPARK-19595.
-
   Resolution: Fixed
Fix Version/s: 2.2.0

Resolved by https://github.com/apache/spark/pull/16929

> from_json produces only a single row when input is a json array
> ---
>
> Key: SPARK-19595
> URL: https://issues.apache.org/jira/browse/SPARK-19595
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
> Fix For: 2.2.0
>
>
> Currently, {{from_json}} reads a single row when it is a json array. For 
> example,
> {code}
> import org.apache.spark.sql.functions._
> import org.apache.spark.sql.types._
> val schema = StructType(StructField("a", IntegerType) :: Nil)
> Seq(("""[{"a": 1}, {"a": 
> 2}]""")).toDF("struct").select(from_json(col("struct"), schema)).show()
> ++
> |jsontostruct(struct)|
> ++
> | [1]|
> ++
> {code}
> Maybe we should not support this in that function or it should work like a 
> generator expression.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19595) from_json produces only a single row when input is a json array

2017-03-05 Thread Burak Yavuz (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Burak Yavuz reassigned SPARK-19595:
---

Assignee: Hyukjin Kwon

> from_json produces only a single row when input is a json array
> ---
>
> Key: SPARK-19595
> URL: https://issues.apache.org/jira/browse/SPARK-19595
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
> Fix For: 2.2.0
>
>
> Currently, {{from_json}} reads a single row when it is a json array. For 
> example,
> {code}
> import org.apache.spark.sql.functions._
> import org.apache.spark.sql.types._
> val schema = StructType(StructField("a", IntegerType) :: Nil)
> Seq(("""[{"a": 1}, {"a": 
> 2}]""")).toDF("struct").select(from_json(col("struct"), schema)).show()
> ++
> |jsontostruct(struct)|
> ++
> | [1]|
> ++
> {code}
> Maybe we should not support this in that function or it should work like a 
> generator expression.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-19656) Can't load custom type from avro file to RDD with newAPIHadoopFile

2017-03-05 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reopened SPARK-19656:
---

> Can't load custom type from avro file to RDD with newAPIHadoopFile
> --
>
> Key: SPARK-19656
> URL: https://issues.apache.org/jira/browse/SPARK-19656
> Project: Spark
>  Issue Type: Question
>  Components: Java API
>Affects Versions: 2.0.2
>Reporter: Nira Amit
>
> If I understand correctly, in scala it's possible to load custom objects from 
> avro files to RDDs this way:
> {code}
> ctx.hadoopFile("/path/to/the/avro/file.avro",
>   classOf[AvroInputFormat[MyClassInAvroFile]],
>   classOf[AvroWrapper[MyClassInAvroFile]],
>   classOf[NullWritable])
> {code}
> I'm not a scala developer, so I tried to "translate" this to java as best I 
> could. I created classes that extend AvroKey and FileInputFormat:
> {code}
> public static class MyCustomAvroKey extends AvroKey{};
> public static class MyCustomAvroReader extends 
> AvroRecordReaderBase {
> // with my custom schema and all the required methods...
> }
> public static class MyCustomInputFormat extends 
> FileInputFormat{
> @Override
> public RecordReader 
> createRecordReader(InputSplit inputSplit, TaskAttemptContext 
> taskAttemptContext) throws IOException, InterruptedException {
> return new MyCustomAvroReader();
> }
> }
> ...
> JavaPairRDD records =
> sc.newAPIHadoopFile("file:/path/to/datafile.avro",
> MyCustomInputFormat.class, MyCustomAvroKey.class,
> NullWritable.class,
> sc.hadoopConfiguration());
> MyCustomClass first = records.first()._1.datum();
> System.out.println("Got a result, some custom field: " + 
> first.getSomeCustomField());
> {code}
> This compiles fine, but using a debugger I can see that `first._1.datum()` 
> actually returns a `GenericData$Record` in runtime, not a `MyCustomClass` 
> instance.
> And indeed, when the following line executes:
> {code}
> MyCustomClass first = records.first()._1.datum();
> {code}
> I get an exception:
> {code}
> java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record 
> cannot be cast to my.package.containing.MyCustomClass
> {code}
> Am I doing it wrong? Or is this not possible in Java?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19656) Can't load custom type from avro file to RDD with newAPIHadoopFile

2017-03-05 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-19656.
---
Resolution: Not A Problem

> Can't load custom type from avro file to RDD with newAPIHadoopFile
> --
>
> Key: SPARK-19656
> URL: https://issues.apache.org/jira/browse/SPARK-19656
> Project: Spark
>  Issue Type: Question
>  Components: Java API
>Affects Versions: 2.0.2
>Reporter: Nira Amit
>
> If I understand correctly, in scala it's possible to load custom objects from 
> avro files to RDDs this way:
> {code}
> ctx.hadoopFile("/path/to/the/avro/file.avro",
>   classOf[AvroInputFormat[MyClassInAvroFile]],
>   classOf[AvroWrapper[MyClassInAvroFile]],
>   classOf[NullWritable])
> {code}
> I'm not a scala developer, so I tried to "translate" this to java as best I 
> could. I created classes that extend AvroKey and FileInputFormat:
> {code}
> public static class MyCustomAvroKey extends AvroKey{};
> public static class MyCustomAvroReader extends 
> AvroRecordReaderBase {
> // with my custom schema and all the required methods...
> }
> public static class MyCustomInputFormat extends 
> FileInputFormat{
> @Override
> public RecordReader 
> createRecordReader(InputSplit inputSplit, TaskAttemptContext 
> taskAttemptContext) throws IOException, InterruptedException {
> return new MyCustomAvroReader();
> }
> }
> ...
> JavaPairRDD records =
> sc.newAPIHadoopFile("file:/path/to/datafile.avro",
> MyCustomInputFormat.class, MyCustomAvroKey.class,
> NullWritable.class,
> sc.hadoopConfiguration());
> MyCustomClass first = records.first()._1.datum();
> System.out.println("Got a result, some custom field: " + 
> first.getSomeCustomField());
> {code}
> This compiles fine, but using a debugger I can see that `first._1.datum()` 
> actually returns a `GenericData$Record` in runtime, not a `MyCustomClass` 
> instance.
> And indeed, when the following line executes:
> {code}
> MyCustomClass first = records.first()._1.datum();
> {code}
> I get an exception:
> {code}
> java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record 
> cannot be cast to my.package.containing.MyCustomClass
> {code}
> Am I doing it wrong? Or is this not possible in Java?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19656) Can't load custom type from avro file to RDD with newAPIHadoopFile

2017-03-05 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-19656.
---
Resolution: Fixed

I do not see anything surprising given your description. Please don't reopen 
this.

> Can't load custom type from avro file to RDD with newAPIHadoopFile
> --
>
> Key: SPARK-19656
> URL: https://issues.apache.org/jira/browse/SPARK-19656
> Project: Spark
>  Issue Type: Question
>  Components: Java API
>Affects Versions: 2.0.2
>Reporter: Nira Amit
>
> If I understand correctly, in scala it's possible to load custom objects from 
> avro files to RDDs this way:
> {code}
> ctx.hadoopFile("/path/to/the/avro/file.avro",
>   classOf[AvroInputFormat[MyClassInAvroFile]],
>   classOf[AvroWrapper[MyClassInAvroFile]],
>   classOf[NullWritable])
> {code}
> I'm not a scala developer, so I tried to "translate" this to java as best I 
> could. I created classes that extend AvroKey and FileInputFormat:
> {code}
> public static class MyCustomAvroKey extends AvroKey{};
> public static class MyCustomAvroReader extends 
> AvroRecordReaderBase {
> // with my custom schema and all the required methods...
> }
> public static class MyCustomInputFormat extends 
> FileInputFormat{
> @Override
> public RecordReader 
> createRecordReader(InputSplit inputSplit, TaskAttemptContext 
> taskAttemptContext) throws IOException, InterruptedException {
> return new MyCustomAvroReader();
> }
> }
> ...
> JavaPairRDD records =
> sc.newAPIHadoopFile("file:/path/to/datafile.avro",
> MyCustomInputFormat.class, MyCustomAvroKey.class,
> NullWritable.class,
> sc.hadoopConfiguration());
> MyCustomClass first = records.first()._1.datum();
> System.out.println("Got a result, some custom field: " + 
> first.getSomeCustomField());
> {code}
> This compiles fine, but using a debugger I can see that `first._1.datum()` 
> actually returns a `GenericData$Record` in runtime, not a `MyCustomClass` 
> instance.
> And indeed, when the following line executes:
> {code}
> MyCustomClass first = records.first()._1.datum();
> {code}
> I get an exception:
> {code}
> java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record 
> cannot be cast to my.package.containing.MyCustomClass
> {code}
> Am I doing it wrong? Or is this not possible in Java?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-19656) Can't load custom type from avro file to RDD with newAPIHadoopFile

2017-03-05 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen closed SPARK-19656.
-

> Can't load custom type from avro file to RDD with newAPIHadoopFile
> --
>
> Key: SPARK-19656
> URL: https://issues.apache.org/jira/browse/SPARK-19656
> Project: Spark
>  Issue Type: Question
>  Components: Java API
>Affects Versions: 2.0.2
>Reporter: Nira Amit
>
> If I understand correctly, in scala it's possible to load custom objects from 
> avro files to RDDs this way:
> {code}
> ctx.hadoopFile("/path/to/the/avro/file.avro",
>   classOf[AvroInputFormat[MyClassInAvroFile]],
>   classOf[AvroWrapper[MyClassInAvroFile]],
>   classOf[NullWritable])
> {code}
> I'm not a scala developer, so I tried to "translate" this to java as best I 
> could. I created classes that extend AvroKey and FileInputFormat:
> {code}
> public static class MyCustomAvroKey extends AvroKey{};
> public static class MyCustomAvroReader extends 
> AvroRecordReaderBase {
> // with my custom schema and all the required methods...
> }
> public static class MyCustomInputFormat extends 
> FileInputFormat{
> @Override
> public RecordReader 
> createRecordReader(InputSplit inputSplit, TaskAttemptContext 
> taskAttemptContext) throws IOException, InterruptedException {
> return new MyCustomAvroReader();
> }
> }
> ...
> JavaPairRDD records =
> sc.newAPIHadoopFile("file:/path/to/datafile.avro",
> MyCustomInputFormat.class, MyCustomAvroKey.class,
> NullWritable.class,
> sc.hadoopConfiguration());
> MyCustomClass first = records.first()._1.datum();
> System.out.println("Got a result, some custom field: " + 
> first.getSomeCustomField());
> {code}
> This compiles fine, but using a debugger I can see that `first._1.datum()` 
> actually returns a `GenericData$Record` in runtime, not a `MyCustomClass` 
> instance.
> And indeed, when the following line executes:
> {code}
> MyCustomClass first = records.first()._1.datum();
> {code}
> I get an exception:
> {code}
> java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record 
> cannot be cast to my.package.containing.MyCustomClass
> {code}
> Am I doing it wrong? Or is this not possible in Java?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19656) Can't load custom type from avro file to RDD with newAPIHadoopFile

2017-03-05 Thread Nira Amit (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15896581#comment-15896581
 ] 

Nira Amit commented on SPARK-19656:
---

I found a problem in my schema and managed to load my custom type. So the 
answer to my original question is basically no, there is nothing like 
{code}
ctx.hadoopFile("/path/to/the/avro/file.avro",
  classOf[AvroInputFormat[MyClassInAvroFile]],
  classOf[AvroWrapper[MyClassInAvroFile]],
  classOf[NullWritable])
{code}
for loading custom types into RDDs with the Java API. We have to create all the 
wrapper classes and implement our own RecordReader.

I think this should be documented somewhere.

> Can't load custom type from avro file to RDD with newAPIHadoopFile
> --
>
> Key: SPARK-19656
> URL: https://issues.apache.org/jira/browse/SPARK-19656
> Project: Spark
>  Issue Type: Question
>  Components: Java API
>Affects Versions: 2.0.2
>Reporter: Nira Amit
>
> If I understand correctly, in scala it's possible to load custom objects from 
> avro files to RDDs this way:
> {code}
> ctx.hadoopFile("/path/to/the/avro/file.avro",
>   classOf[AvroInputFormat[MyClassInAvroFile]],
>   classOf[AvroWrapper[MyClassInAvroFile]],
>   classOf[NullWritable])
> {code}
> I'm not a scala developer, so I tried to "translate" this to java as best I 
> could. I created classes that extend AvroKey and FileInputFormat:
> {code}
> public static class MyCustomAvroKey extends AvroKey{};
> public static class MyCustomAvroReader extends 
> AvroRecordReaderBase {
> // with my custom schema and all the required methods...
> }
> public static class MyCustomInputFormat extends 
> FileInputFormat{
> @Override
> public RecordReader 
> createRecordReader(InputSplit inputSplit, TaskAttemptContext 
> taskAttemptContext) throws IOException, InterruptedException {
> return new MyCustomAvroReader();
> }
> }
> ...
> JavaPairRDD records =
> sc.newAPIHadoopFile("file:/path/to/datafile.avro",
> MyCustomInputFormat.class, MyCustomAvroKey.class,
> NullWritable.class,
> sc.hadoopConfiguration());
> MyCustomClass first = records.first()._1.datum();
> System.out.println("Got a result, some custom field: " + 
> first.getSomeCustomField());
> {code}
> This compiles fine, but using a debugger I can see that `first._1.datum()` 
> actually returns a `GenericData$Record` in runtime, not a `MyCustomClass` 
> instance.
> And indeed, when the following line executes:
> {code}
> MyCustomClass first = records.first()._1.datum();
> {code}
> I get an exception:
> {code}
> java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record 
> cannot be cast to my.package.containing.MyCustomClass
> {code}
> Am I doing it wrong? Or is this not possible in Java?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19656) Can't load custom type from avro file to RDD with newAPIHadoopFile

2017-03-05 Thread Nira Amit (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nira Amit updated SPARK-19656:
--
Attachment: (was: datum2.png)

> Can't load custom type from avro file to RDD with newAPIHadoopFile
> --
>
> Key: SPARK-19656
> URL: https://issues.apache.org/jira/browse/SPARK-19656
> Project: Spark
>  Issue Type: Question
>  Components: Java API
>Affects Versions: 2.0.2
>Reporter: Nira Amit
>
> If I understand correctly, in scala it's possible to load custom objects from 
> avro files to RDDs this way:
> {code}
> ctx.hadoopFile("/path/to/the/avro/file.avro",
>   classOf[AvroInputFormat[MyClassInAvroFile]],
>   classOf[AvroWrapper[MyClassInAvroFile]],
>   classOf[NullWritable])
> {code}
> I'm not a scala developer, so I tried to "translate" this to java as best I 
> could. I created classes that extend AvroKey and FileInputFormat:
> {code}
> public static class MyCustomAvroKey extends AvroKey{};
> public static class MyCustomAvroReader extends 
> AvroRecordReaderBase {
> // with my custom schema and all the required methods...
> }
> public static class MyCustomInputFormat extends 
> FileInputFormat{
> @Override
> public RecordReader 
> createRecordReader(InputSplit inputSplit, TaskAttemptContext 
> taskAttemptContext) throws IOException, InterruptedException {
> return new MyCustomAvroReader();
> }
> }
> ...
> JavaPairRDD records =
> sc.newAPIHadoopFile("file:/path/to/datafile.avro",
> MyCustomInputFormat.class, MyCustomAvroKey.class,
> NullWritable.class,
> sc.hadoopConfiguration());
> MyCustomClass first = records.first()._1.datum();
> System.out.println("Got a result, some custom field: " + 
> first.getSomeCustomField());
> {code}
> This compiles fine, but using a debugger I can see that `first._1.datum()` 
> actually returns a `GenericData$Record` in runtime, not a `MyCustomClass` 
> instance.
> And indeed, when the following line executes:
> {code}
> MyCustomClass first = records.first()._1.datum();
> {code}
> I get an exception:
> {code}
> java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record 
> cannot be cast to my.package.containing.MyCustomClass
> {code}
> Am I doing it wrong? Or is this not possible in Java?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-19656) Can't load custom type from avro file to RDD with newAPIHadoopFile

2017-03-05 Thread Nira Amit (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nira Amit updated SPARK-19656:
--
Comment: was deleted

(was: {code}
public static class ABgoEventAvroReader extends 
AvroRecordReaderBase {
static Schema schema;
static {
try {
schema = new 
Schema.Parser().parse(AvroTest.class.getResourceAsStream("/Abgo.avsc"));
} catch (IOException e) {
throw new RuntimeException(e);
}
}

/** A reusable object to hold records of the Avro container file. */
private final ABgoEventAvroKey mCurrentRecord;


public ABgoEventAvroReader() {
super(schema);
mCurrentRecord = new ABgoEventAvroKey();
}

/** {@inheritDoc} */
@Override
public boolean nextKeyValue() throws IOException, InterruptedException {
boolean hasNext = super.nextKeyValue();
mCurrentRecord.datum(getCurrentRecord());
return hasNext;
}

/** {@inheritDoc} */
@Override
public ABgoEventAvroKey getCurrentKey() throws IOException, 
InterruptedException {
return mCurrentRecord;
}

/** {@inheritDoc} */
@Override
public NullWritable getCurrentValue() throws IOException, 
InterruptedException {
return NullWritable.get();
}
}

{code})

> Can't load custom type from avro file to RDD with newAPIHadoopFile
> --
>
> Key: SPARK-19656
> URL: https://issues.apache.org/jira/browse/SPARK-19656
> Project: Spark
>  Issue Type: Question
>  Components: Java API
>Affects Versions: 2.0.2
>Reporter: Nira Amit
>
> If I understand correctly, in scala it's possible to load custom objects from 
> avro files to RDDs this way:
> {code}
> ctx.hadoopFile("/path/to/the/avro/file.avro",
>   classOf[AvroInputFormat[MyClassInAvroFile]],
>   classOf[AvroWrapper[MyClassInAvroFile]],
>   classOf[NullWritable])
> {code}
> I'm not a scala developer, so I tried to "translate" this to java as best I 
> could. I created classes that extend AvroKey and FileInputFormat:
> {code}
> public static class MyCustomAvroKey extends AvroKey{};
> public static class MyCustomAvroReader extends 
> AvroRecordReaderBase {
> // with my custom schema and all the required methods...
> }
> public static class MyCustomInputFormat extends 
> FileInputFormat{
> @Override
> public RecordReader 
> createRecordReader(InputSplit inputSplit, TaskAttemptContext 
> taskAttemptContext) throws IOException, InterruptedException {
> return new MyCustomAvroReader();
> }
> }
> ...
> JavaPairRDD records =
> sc.newAPIHadoopFile("file:/path/to/datafile.avro",
> MyCustomInputFormat.class, MyCustomAvroKey.class,
> NullWritable.class,
> sc.hadoopConfiguration());
> MyCustomClass first = records.first()._1.datum();
> System.out.println("Got a result, some custom field: " + 
> first.getSomeCustomField());
> {code}
> This compiles fine, but using a debugger I can see that `first._1.datum()` 
> actually returns a `GenericData$Record` in runtime, not a `MyCustomClass` 
> instance.
> And indeed, when the following line executes:
> {code}
> MyCustomClass first = records.first()._1.datum();
> {code}
> I get an exception:
> {code}
> java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record 
> cannot be cast to my.package.containing.MyCustomClass
> {code}
> Am I doing it wrong? Or is this not possible in Java?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19656) Can't load custom type from avro file to RDD with newAPIHadoopFile

2017-03-05 Thread Nira Amit (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nira Amit updated SPARK-19656:
--
Attachment: (was: datum.png)

> Can't load custom type from avro file to RDD with newAPIHadoopFile
> --
>
> Key: SPARK-19656
> URL: https://issues.apache.org/jira/browse/SPARK-19656
> Project: Spark
>  Issue Type: Question
>  Components: Java API
>Affects Versions: 2.0.2
>Reporter: Nira Amit
>
> If I understand correctly, in scala it's possible to load custom objects from 
> avro files to RDDs this way:
> {code}
> ctx.hadoopFile("/path/to/the/avro/file.avro",
>   classOf[AvroInputFormat[MyClassInAvroFile]],
>   classOf[AvroWrapper[MyClassInAvroFile]],
>   classOf[NullWritable])
> {code}
> I'm not a scala developer, so I tried to "translate" this to java as best I 
> could. I created classes that extend AvroKey and FileInputFormat:
> {code}
> public static class MyCustomAvroKey extends AvroKey{};
> public static class MyCustomAvroReader extends 
> AvroRecordReaderBase {
> // with my custom schema and all the required methods...
> }
> public static class MyCustomInputFormat extends 
> FileInputFormat{
> @Override
> public RecordReader 
> createRecordReader(InputSplit inputSplit, TaskAttemptContext 
> taskAttemptContext) throws IOException, InterruptedException {
> return new MyCustomAvroReader();
> }
> }
> ...
> JavaPairRDD records =
> sc.newAPIHadoopFile("file:/path/to/datafile.avro",
> MyCustomInputFormat.class, MyCustomAvroKey.class,
> NullWritable.class,
> sc.hadoopConfiguration());
> MyCustomClass first = records.first()._1.datum();
> System.out.println("Got a result, some custom field: " + 
> first.getSomeCustomField());
> {code}
> This compiles fine, but using a debugger I can see that `first._1.datum()` 
> actually returns a `GenericData$Record` in runtime, not a `MyCustomClass` 
> instance.
> And indeed, when the following line executes:
> {code}
> MyCustomClass first = records.first()._1.datum();
> {code}
> I get an exception:
> {code}
> java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record 
> cannot be cast to my.package.containing.MyCustomClass
> {code}
> Am I doing it wrong? Or is this not possible in Java?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-19656) Can't load custom type from avro file to RDD with newAPIHadoopFile

2017-03-05 Thread Nira Amit (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nira Amit updated SPARK-19656:
--
Comment: was deleted

(was: [~srowen] Will you at least consider the possibility that I'm on to a 
real problem here? I may be wrong of course, but I'm telling you that I've been 
struggling with this for weeks and that other developers are struggling with 
the same thing. At the very least it's worthwhile clarifying this issue. I'm 
attaching a screenshot of what my code returns in runtime, it's definitely the 
right class. So the problem is NOT with my file reader. My reader extends  
AvroRecordReaderBase.)

> Can't load custom type from avro file to RDD with newAPIHadoopFile
> --
>
> Key: SPARK-19656
> URL: https://issues.apache.org/jira/browse/SPARK-19656
> Project: Spark
>  Issue Type: Question
>  Components: Java API
>Affects Versions: 2.0.2
>Reporter: Nira Amit
> Attachments: datum2.png, datum.png
>
>
> If I understand correctly, in scala it's possible to load custom objects from 
> avro files to RDDs this way:
> {code}
> ctx.hadoopFile("/path/to/the/avro/file.avro",
>   classOf[AvroInputFormat[MyClassInAvroFile]],
>   classOf[AvroWrapper[MyClassInAvroFile]],
>   classOf[NullWritable])
> {code}
> I'm not a scala developer, so I tried to "translate" this to java as best I 
> could. I created classes that extend AvroKey and FileInputFormat:
> {code}
> public static class MyCustomAvroKey extends AvroKey{};
> public static class MyCustomAvroReader extends 
> AvroRecordReaderBase {
> // with my custom schema and all the required methods...
> }
> public static class MyCustomInputFormat extends 
> FileInputFormat{
> @Override
> public RecordReader 
> createRecordReader(InputSplit inputSplit, TaskAttemptContext 
> taskAttemptContext) throws IOException, InterruptedException {
> return new MyCustomAvroReader();
> }
> }
> ...
> JavaPairRDD records =
> sc.newAPIHadoopFile("file:/path/to/datafile.avro",
> MyCustomInputFormat.class, MyCustomAvroKey.class,
> NullWritable.class,
> sc.hadoopConfiguration());
> MyCustomClass first = records.first()._1.datum();
> System.out.println("Got a result, some custom field: " + 
> first.getSomeCustomField());
> {code}
> This compiles fine, but using a debugger I can see that `first._1.datum()` 
> actually returns a `GenericData$Record` in runtime, not a `MyCustomClass` 
> instance.
> And indeed, when the following line executes:
> {code}
> MyCustomClass first = records.first()._1.datum();
> {code}
> I get an exception:
> {code}
> java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record 
> cannot be cast to my.package.containing.MyCustomClass
> {code}
> Am I doing it wrong? Or is this not possible in Java?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19795) R should support column functions to_json, from_json

2017-03-05 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung resolved SPARK-19795.
--
  Resolution: Fixed
   Fix Version/s: 2.2.0
Target Version/s: 2.2.0

> R should support column functions to_json, from_json
> 
>
> Key: SPARK-19795
> URL: https://issues.apache.org/jira/browse/SPARK-19795
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Felix Cheung
>Assignee: Felix Cheung
> Fix For: 2.2.0
>
>
> Particularly since R does not comes with support for process JSON



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19827) spark.ml R API for PIC

2017-03-05 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-19827:
-
Shepherd: Felix Cheung

> spark.ml R API for PIC
> --
>
> Key: SPARK-19827
> URL: https://issues.apache.org/jira/browse/SPARK-19827
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, SparkR
>Affects Versions: 2.1.0
>Reporter: Felix Cheung
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19825) spark.ml R API for FPGrowth

2017-03-05 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-19825:
-
Shepherd: Felix Cheung

> spark.ml R API for FPGrowth
> ---
>
> Key: SPARK-19825
> URL: https://issues.apache.org/jira/browse/SPARK-19825
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, SparkR
>Affects Versions: 2.1.0
>Reporter: Felix Cheung
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19827) spark.ml R API for PIC

2017-03-05 Thread Felix Cheung (JIRA)
Felix Cheung created SPARK-19827:


 Summary: spark.ml R API for PIC
 Key: SPARK-19827
 URL: https://issues.apache.org/jira/browse/SPARK-19827
 Project: Spark
  Issue Type: Sub-task
  Components: ML, SparkR
Affects Versions: 2.1.0
Reporter: Felix Cheung






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19826) spark.ml Python API for PIC

2017-03-05 Thread Felix Cheung (JIRA)
Felix Cheung created SPARK-19826:


 Summary: spark.ml Python API for PIC
 Key: SPARK-19826
 URL: https://issues.apache.org/jira/browse/SPARK-19826
 Project: Spark
  Issue Type: Sub-task
  Components: ML, PySpark
Affects Versions: 2.1.0
Reporter: Felix Cheung






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19825) spark.ml R API for FPGrowth

2017-03-05 Thread Felix Cheung (JIRA)
Felix Cheung created SPARK-19825:


 Summary: spark.ml R API for FPGrowth
 Key: SPARK-19825
 URL: https://issues.apache.org/jira/browse/SPARK-19825
 Project: Spark
  Issue Type: Sub-task
  Components: ML, SparkR
Affects Versions: 2.1.0
Reporter: Felix Cheung






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19656) Can't load custom type from avro file to RDD with newAPIHadoopFile

2017-03-05 Thread Nira Amit (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nira Amit updated SPARK-19656:
--
Attachment: datum2.png

> Can't load custom type from avro file to RDD with newAPIHadoopFile
> --
>
> Key: SPARK-19656
> URL: https://issues.apache.org/jira/browse/SPARK-19656
> Project: Spark
>  Issue Type: Question
>  Components: Java API
>Affects Versions: 2.0.2
>Reporter: Nira Amit
> Attachments: datum2.png, datum.png
>
>
> If I understand correctly, in scala it's possible to load custom objects from 
> avro files to RDDs this way:
> {code}
> ctx.hadoopFile("/path/to/the/avro/file.avro",
>   classOf[AvroInputFormat[MyClassInAvroFile]],
>   classOf[AvroWrapper[MyClassInAvroFile]],
>   classOf[NullWritable])
> {code}
> I'm not a scala developer, so I tried to "translate" this to java as best I 
> could. I created classes that extend AvroKey and FileInputFormat:
> {code}
> public static class MyCustomAvroKey extends AvroKey{};
> public static class MyCustomAvroReader extends 
> AvroRecordReaderBase {
> // with my custom schema and all the required methods...
> }
> public static class MyCustomInputFormat extends 
> FileInputFormat{
> @Override
> public RecordReader 
> createRecordReader(InputSplit inputSplit, TaskAttemptContext 
> taskAttemptContext) throws IOException, InterruptedException {
> return new MyCustomAvroReader();
> }
> }
> ...
> JavaPairRDD records =
> sc.newAPIHadoopFile("file:/path/to/datafile.avro",
> MyCustomInputFormat.class, MyCustomAvroKey.class,
> NullWritable.class,
> sc.hadoopConfiguration());
> MyCustomClass first = records.first()._1.datum();
> System.out.println("Got a result, some custom field: " + 
> first.getSomeCustomField());
> {code}
> This compiles fine, but using a debugger I can see that `first._1.datum()` 
> actually returns a `GenericData$Record` in runtime, not a `MyCustomClass` 
> instance.
> And indeed, when the following line executes:
> {code}
> MyCustomClass first = records.first()._1.datum();
> {code}
> I get an exception:
> {code}
> java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record 
> cannot be cast to my.package.containing.MyCustomClass
> {code}
> Am I doing it wrong? Or is this not possible in Java?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19656) Can't load custom type from avro file to RDD with newAPIHadoopFile

2017-03-05 Thread Nira Amit (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nira Amit updated SPARK-19656:
--
Attachment: datum.png

{code}
public static class ABgoEventAvroReader extends 
AvroRecordReaderBase {
static Schema schema;
static {
try {
schema = new 
Schema.Parser().parse(AvroTest.class.getResourceAsStream("/Abgo.avsc"));
} catch (IOException e) {
throw new RuntimeException(e);
}
}

/** A reusable object to hold records of the Avro container file. */
private final ABgoEventAvroKey mCurrentRecord;


public ABgoEventAvroReader() {
super(schema);
mCurrentRecord = new ABgoEventAvroKey();
}

/** {@inheritDoc} */
@Override
public boolean nextKeyValue() throws IOException, InterruptedException {
boolean hasNext = super.nextKeyValue();
mCurrentRecord.datum(getCurrentRecord());
return hasNext;
}

/** {@inheritDoc} */
@Override
public ABgoEventAvroKey getCurrentKey() throws IOException, 
InterruptedException {
return mCurrentRecord;
}

/** {@inheritDoc} */
@Override
public NullWritable getCurrentValue() throws IOException, 
InterruptedException {
return NullWritable.get();
}
}

{code}

> Can't load custom type from avro file to RDD with newAPIHadoopFile
> --
>
> Key: SPARK-19656
> URL: https://issues.apache.org/jira/browse/SPARK-19656
> Project: Spark
>  Issue Type: Question
>  Components: Java API
>Affects Versions: 2.0.2
>Reporter: Nira Amit
> Attachments: datum.png
>
>
> If I understand correctly, in scala it's possible to load custom objects from 
> avro files to RDDs this way:
> {code}
> ctx.hadoopFile("/path/to/the/avro/file.avro",
>   classOf[AvroInputFormat[MyClassInAvroFile]],
>   classOf[AvroWrapper[MyClassInAvroFile]],
>   classOf[NullWritable])
> {code}
> I'm not a scala developer, so I tried to "translate" this to java as best I 
> could. I created classes that extend AvroKey and FileInputFormat:
> {code}
> public static class MyCustomAvroKey extends AvroKey{};
> public static class MyCustomAvroReader extends 
> AvroRecordReaderBase {
> // with my custom schema and all the required methods...
> }
> public static class MyCustomInputFormat extends 
> FileInputFormat{
> @Override
> public RecordReader 
> createRecordReader(InputSplit inputSplit, TaskAttemptContext 
> taskAttemptContext) throws IOException, InterruptedException {
> return new MyCustomAvroReader();
> }
> }
> ...
> JavaPairRDD records =
> sc.newAPIHadoopFile("file:/path/to/datafile.avro",
> MyCustomInputFormat.class, MyCustomAvroKey.class,
> NullWritable.class,
> sc.hadoopConfiguration());
> MyCustomClass first = records.first()._1.datum();
> System.out.println("Got a result, some custom field: " + 
> first.getSomeCustomField());
> {code}
> This compiles fine, but using a debugger I can see that `first._1.datum()` 
> actually returns a `GenericData$Record` in runtime, not a `MyCustomClass` 
> instance.
> And indeed, when the following line executes:
> {code}
> MyCustomClass first = records.first()._1.datum();
> {code}
> I get an exception:
> {code}
> java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record 
> cannot be cast to my.package.containing.MyCustomClass
> {code}
> Am I doing it wrong? Or is this not possible in Java?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-19656) Can't load custom type from avro file to RDD with newAPIHadoopFile

2017-03-05 Thread Nira Amit (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nira Amit reopened SPARK-19656:
---

[~srowen] Will you at least consider the possibility that I'm on to a real 
problem here? I may be wrong of course, but I'm telling you that I've been 
struggling with this for weeks and that other developers are struggling with 
the same thing. At the very least it's worthwhile clarifying this issue. I'm 
attaching a screenshot of what my code returns in runtime, it's definitely the 
right class. So the problem is NOT with my file reader. My reader extends  
AvroRecordReaderBase.

> Can't load custom type from avro file to RDD with newAPIHadoopFile
> --
>
> Key: SPARK-19656
> URL: https://issues.apache.org/jira/browse/SPARK-19656
> Project: Spark
>  Issue Type: Question
>  Components: Java API
>Affects Versions: 2.0.2
>Reporter: Nira Amit
>
> If I understand correctly, in scala it's possible to load custom objects from 
> avro files to RDDs this way:
> {code}
> ctx.hadoopFile("/path/to/the/avro/file.avro",
>   classOf[AvroInputFormat[MyClassInAvroFile]],
>   classOf[AvroWrapper[MyClassInAvroFile]],
>   classOf[NullWritable])
> {code}
> I'm not a scala developer, so I tried to "translate" this to java as best I 
> could. I created classes that extend AvroKey and FileInputFormat:
> {code}
> public static class MyCustomAvroKey extends AvroKey{};
> public static class MyCustomAvroReader extends 
> AvroRecordReaderBase {
> // with my custom schema and all the required methods...
> }
> public static class MyCustomInputFormat extends 
> FileInputFormat{
> @Override
> public RecordReader 
> createRecordReader(InputSplit inputSplit, TaskAttemptContext 
> taskAttemptContext) throws IOException, InterruptedException {
> return new MyCustomAvroReader();
> }
> }
> ...
> JavaPairRDD records =
> sc.newAPIHadoopFile("file:/path/to/datafile.avro",
> MyCustomInputFormat.class, MyCustomAvroKey.class,
> NullWritable.class,
> sc.hadoopConfiguration());
> MyCustomClass first = records.first()._1.datum();
> System.out.println("Got a result, some custom field: " + 
> first.getSomeCustomField());
> {code}
> This compiles fine, but using a debugger I can see that `first._1.datum()` 
> actually returns a `GenericData$Record` in runtime, not a `MyCustomClass` 
> instance.
> And indeed, when the following line executes:
> {code}
> MyCustomClass first = records.first()._1.datum();
> {code}
> I get an exception:
> {code}
> java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record 
> cannot be cast to my.package.containing.MyCustomClass
> {code}
> Am I doing it wrong? Or is this not possible in Java?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14083) Analyze JVM bytecode and turn closures into Catalyst expressions

2017-03-05 Thread Kazuaki Ishizaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15896452#comment-15896452
 ] 

Kazuaki Ishizaki commented on SPARK-14083:
--

Does anyone go forward with this?  If not, I will continue to work for this.
Recently, I noticed that Spark already uses ASM framework that provides similar 
features to Javassist.

> Analyze JVM bytecode and turn closures into Catalyst expressions
> 
>
> Key: SPARK-14083
> URL: https://issues.apache.org/jira/browse/SPARK-14083
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>
> One big advantage of the Dataset API is the type safety, at the cost of 
> performance due to heavy reliance on user-defined closures/lambdas. These 
> closures are typically slower than expressions because we have more 
> flexibility to optimize expressions (known data types, no virtual function 
> calls, etc). In many cases, it's actually not going to be very difficult to 
> look into the byte code of these closures and figure out what they are trying 
> to do. If we can understand them, then we can turn them directly into 
> Catalyst expressions for more optimized executions.
> Some examples are:
> {code}
> df.map(_.name)  // equivalent to expression col("name")
> ds.groupBy(_.gender)  // equivalent to expression col("gender")
> df.filter(_.age > 18)  // equivalent to expression GreaterThan(col("age"), 
> lit(18)
> df.map(_.id + 1)  // equivalent to Add(col("age"), lit(1))
> {code}
> The goal of this ticket is to design a small framework for byte code analysis 
> and use that to convert closures/lambdas into Catalyst expressions in order 
> to speed up Dataset execution. It is a little bit futuristic, but I believe 
> it is very doable. The framework should be easy to reason about (e.g. similar 
> to Catalyst).
> Note that a big emphasis on "small" and "easy to reason about". A patch 
> should be rejected if it is too complicated or difficult to reason about.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19656) Can't load custom type from avro file to RDD with newAPIHadoopFile

2017-03-05 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15896343#comment-15896343
 ] 

Sean Owen commented on SPARK-19656:
---

It accepts it because you tell it that's what the InputFormat will return, but 
it doesn't. The Class arg is there just for its compile-time type. That doesn't 
make it so and it doesn't have a way of verifying it's what your InputFormat 
returns.

newAPIHadoopFile doesn't load as anything in particular; the InputFormat does. 
You are still really talking about Hadoop and Avro APIs.

I'm going to leave the conversation there and close this, as this is as much as 
is reasonable to consider in the context of Spark. This is not a bug as-is. You 
can take this info to explore how to work with Avro values elsewhere. A JIRA 
can be reopened if you have a clear and reproducible problem in what Spark is 
supposed to return or do and what it does. That does require understanding the 
operation of Hadoop APIs. Questions should stay on the mailing list or SO, if 
it's still in the realm of "how can I get this to work?"

> Can't load custom type from avro file to RDD with newAPIHadoopFile
> --
>
> Key: SPARK-19656
> URL: https://issues.apache.org/jira/browse/SPARK-19656
> Project: Spark
>  Issue Type: Question
>  Components: Java API
>Affects Versions: 2.0.2
>Reporter: Nira Amit
>
> If I understand correctly, in scala it's possible to load custom objects from 
> avro files to RDDs this way:
> {code}
> ctx.hadoopFile("/path/to/the/avro/file.avro",
>   classOf[AvroInputFormat[MyClassInAvroFile]],
>   classOf[AvroWrapper[MyClassInAvroFile]],
>   classOf[NullWritable])
> {code}
> I'm not a scala developer, so I tried to "translate" this to java as best I 
> could. I created classes that extend AvroKey and FileInputFormat:
> {code}
> public static class MyCustomAvroKey extends AvroKey{};
> public static class MyCustomAvroReader extends 
> AvroRecordReaderBase {
> // with my custom schema and all the required methods...
> }
> public static class MyCustomInputFormat extends 
> FileInputFormat{
> @Override
> public RecordReader 
> createRecordReader(InputSplit inputSplit, TaskAttemptContext 
> taskAttemptContext) throws IOException, InterruptedException {
> return new MyCustomAvroReader();
> }
> }
> ...
> JavaPairRDD records =
> sc.newAPIHadoopFile("file:/path/to/datafile.avro",
> MyCustomInputFormat.class, MyCustomAvroKey.class,
> NullWritable.class,
> sc.hadoopConfiguration());
> MyCustomClass first = records.first()._1.datum();
> System.out.println("Got a result, some custom field: " + 
> first.getSomeCustomField());
> {code}
> This compiles fine, but using a debugger I can see that `first._1.datum()` 
> actually returns a `GenericData$Record` in runtime, not a `MyCustomClass` 
> instance.
> And indeed, when the following line executes:
> {code}
> MyCustomClass first = records.first()._1.datum();
> {code}
> I get an exception:
> {code}
> java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record 
> cannot be cast to my.package.containing.MyCustomClass
> {code}
> Am I doing it wrong? Or is this not possible in Java?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-19656) Can't load custom type from avro file to RDD with newAPIHadoopFile

2017-03-05 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen closed SPARK-19656.
-

> Can't load custom type from avro file to RDD with newAPIHadoopFile
> --
>
> Key: SPARK-19656
> URL: https://issues.apache.org/jira/browse/SPARK-19656
> Project: Spark
>  Issue Type: Question
>  Components: Java API
>Affects Versions: 2.0.2
>Reporter: Nira Amit
>
> If I understand correctly, in scala it's possible to load custom objects from 
> avro files to RDDs this way:
> {code}
> ctx.hadoopFile("/path/to/the/avro/file.avro",
>   classOf[AvroInputFormat[MyClassInAvroFile]],
>   classOf[AvroWrapper[MyClassInAvroFile]],
>   classOf[NullWritable])
> {code}
> I'm not a scala developer, so I tried to "translate" this to java as best I 
> could. I created classes that extend AvroKey and FileInputFormat:
> {code}
> public static class MyCustomAvroKey extends AvroKey{};
> public static class MyCustomAvroReader extends 
> AvroRecordReaderBase {
> // with my custom schema and all the required methods...
> }
> public static class MyCustomInputFormat extends 
> FileInputFormat{
> @Override
> public RecordReader 
> createRecordReader(InputSplit inputSplit, TaskAttemptContext 
> taskAttemptContext) throws IOException, InterruptedException {
> return new MyCustomAvroReader();
> }
> }
> ...
> JavaPairRDD records =
> sc.newAPIHadoopFile("file:/path/to/datafile.avro",
> MyCustomInputFormat.class, MyCustomAvroKey.class,
> NullWritable.class,
> sc.hadoopConfiguration());
> MyCustomClass first = records.first()._1.datum();
> System.out.println("Got a result, some custom field: " + 
> first.getSomeCustomField());
> {code}
> This compiles fine, but using a debugger I can see that `first._1.datum()` 
> actually returns a `GenericData$Record` in runtime, not a `MyCustomClass` 
> instance.
> And indeed, when the following line executes:
> {code}
> MyCustomClass first = records.first()._1.datum();
> {code}
> I get an exception:
> {code}
> java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record 
> cannot be cast to my.package.containing.MyCustomClass
> {code}
> Am I doing it wrong? Or is this not possible in Java?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19656) Can't load custom type from avro file to RDD with newAPIHadoopFile

2017-03-05 Thread Nira Amit (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15896314#comment-15896314
 ] 

Nira Amit commented on SPARK-19656:
---

And by the way, "what is in the file" is bytes. The question is what I load 
these bytes into. I'm trying to load them into a MyCustomClass, apparently what 
newAPIHadoopFile is loading them into is GenericData$Record. Even though the 
return type it promises is JavaPairRDD. 

> Can't load custom type from avro file to RDD with newAPIHadoopFile
> --
>
> Key: SPARK-19656
> URL: https://issues.apache.org/jira/browse/SPARK-19656
> Project: Spark
>  Issue Type: Question
>  Components: Java API
>Affects Versions: 2.0.2
>Reporter: Nira Amit
>
> If I understand correctly, in scala it's possible to load custom objects from 
> avro files to RDDs this way:
> {code}
> ctx.hadoopFile("/path/to/the/avro/file.avro",
>   classOf[AvroInputFormat[MyClassInAvroFile]],
>   classOf[AvroWrapper[MyClassInAvroFile]],
>   classOf[NullWritable])
> {code}
> I'm not a scala developer, so I tried to "translate" this to java as best I 
> could. I created classes that extend AvroKey and FileInputFormat:
> {code}
> public static class MyCustomAvroKey extends AvroKey{};
> public static class MyCustomAvroReader extends 
> AvroRecordReaderBase {
> // with my custom schema and all the required methods...
> }
> public static class MyCustomInputFormat extends 
> FileInputFormat{
> @Override
> public RecordReader 
> createRecordReader(InputSplit inputSplit, TaskAttemptContext 
> taskAttemptContext) throws IOException, InterruptedException {
> return new MyCustomAvroReader();
> }
> }
> ...
> JavaPairRDD records =
> sc.newAPIHadoopFile("file:/path/to/datafile.avro",
> MyCustomInputFormat.class, MyCustomAvroKey.class,
> NullWritable.class,
> sc.hadoopConfiguration());
> MyCustomClass first = records.first()._1.datum();
> System.out.println("Got a result, some custom field: " + 
> first.getSomeCustomField());
> {code}
> This compiles fine, but using a debugger I can see that `first._1.datum()` 
> actually returns a `GenericData$Record` in runtime, not a `MyCustomClass` 
> instance.
> And indeed, when the following line executes:
> {code}
> MyCustomClass first = records.first()._1.datum();
> {code}
> I get an exception:
> {code}
> java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record 
> cannot be cast to my.package.containing.MyCustomClass
> {code}
> Am I doing it wrong? Or is this not possible in Java?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19714) Bucketizer Bug Regarding Handling Unbucketed Inputs

2017-03-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19714:


Assignee: Apache Spark

> Bucketizer Bug Regarding Handling Unbucketed Inputs
> ---
>
> Key: SPARK-19714
> URL: https://issues.apache.org/jira/browse/SPARK-19714
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 2.1.0
>Reporter: Bill Chambers
>Assignee: Apache Spark
>
> {code}
> contDF = spark.range(500).selectExpr("cast(id as double) as id")
> import org.apache.spark.ml.feature.Bucketizer
> val splits = Array(5.0, 10.0, 250.0, 500.0)
> val bucketer = new Bucketizer()
>   .setSplits(splits)
>   .setInputCol("id")
>   .setHandleInvalid("skip")
> bucketer.transform(contDF).show()
> {code}
> You would expect that this would handle the invalid buckets. However it fails
> {code}
> Caused by: org.apache.spark.SparkException: Feature value 0.0 out of 
> Bucketizer bounds [5.0, 500.0].  Check your features, or loosen the 
> lower/upper bound constraints.
> {code} 
> It seems strange that handleInvalud doesn't actually handleInvalid inputs.
> Thoughts anyone?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19714) Bucketizer Bug Regarding Handling Unbucketed Inputs

2017-03-05 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15896290#comment-15896290
 ] 

Apache Spark commented on SPARK-19714:
--

User 'wojtek-szymanski' has created a pull request for this issue:
https://github.com/apache/spark/pull/17169

> Bucketizer Bug Regarding Handling Unbucketed Inputs
> ---
>
> Key: SPARK-19714
> URL: https://issues.apache.org/jira/browse/SPARK-19714
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 2.1.0
>Reporter: Bill Chambers
>
> {code}
> contDF = spark.range(500).selectExpr("cast(id as double) as id")
> import org.apache.spark.ml.feature.Bucketizer
> val splits = Array(5.0, 10.0, 250.0, 500.0)
> val bucketer = new Bucketizer()
>   .setSplits(splits)
>   .setInputCol("id")
>   .setHandleInvalid("skip")
> bucketer.transform(contDF).show()
> {code}
> You would expect that this would handle the invalid buckets. However it fails
> {code}
> Caused by: org.apache.spark.SparkException: Feature value 0.0 out of 
> Bucketizer bounds [5.0, 500.0].  Check your features, or loosen the 
> lower/upper bound constraints.
> {code} 
> It seems strange that handleInvalud doesn't actually handleInvalid inputs.
> Thoughts anyone?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19714) Bucketizer Bug Regarding Handling Unbucketed Inputs

2017-03-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19714:


Assignee: (was: Apache Spark)

> Bucketizer Bug Regarding Handling Unbucketed Inputs
> ---
>
> Key: SPARK-19714
> URL: https://issues.apache.org/jira/browse/SPARK-19714
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 2.1.0
>Reporter: Bill Chambers
>
> {code}
> contDF = spark.range(500).selectExpr("cast(id as double) as id")
> import org.apache.spark.ml.feature.Bucketizer
> val splits = Array(5.0, 10.0, 250.0, 500.0)
> val bucketer = new Bucketizer()
>   .setSplits(splits)
>   .setInputCol("id")
>   .setHandleInvalid("skip")
> bucketer.transform(contDF).show()
> {code}
> You would expect that this would handle the invalid buckets. However it fails
> {code}
> Caused by: org.apache.spark.SparkException: Feature value 0.0 out of 
> Bucketizer bounds [5.0, 500.0].  Check your features, or loosen the 
> lower/upper bound constraints.
> {code} 
> It seems strange that handleInvalud doesn't actually handleInvalid inputs.
> Thoughts anyone?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-19656) Can't load custom type from avro file to RDD with newAPIHadoopFile

2017-03-05 Thread Nira Amit (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15896286#comment-15896286
 ] 

Nira Amit edited comment on SPARK-19656 at 3/5/17 2:55 PM:
---

But then why does the compiler accept what newAPIHadoopFile returns as 
MyCustomClass? If what you are saying is correct, than the only return type 
that should be acceptable is a GenericData$Record or something that can be 
casted to it.


was (Author: amitnira):
But then why does the compiler accepts what newAPIHadoopFile returns as 
MyCustomClass? If what you are saying is correct, than the only return type 
that should be acceptable is a GenericData$Record or something that can be 
casted to it.

> Can't load custom type from avro file to RDD with newAPIHadoopFile
> --
>
> Key: SPARK-19656
> URL: https://issues.apache.org/jira/browse/SPARK-19656
> Project: Spark
>  Issue Type: Question
>  Components: Java API
>Affects Versions: 2.0.2
>Reporter: Nira Amit
>
> If I understand correctly, in scala it's possible to load custom objects from 
> avro files to RDDs this way:
> {code}
> ctx.hadoopFile("/path/to/the/avro/file.avro",
>   classOf[AvroInputFormat[MyClassInAvroFile]],
>   classOf[AvroWrapper[MyClassInAvroFile]],
>   classOf[NullWritable])
> {code}
> I'm not a scala developer, so I tried to "translate" this to java as best I 
> could. I created classes that extend AvroKey and FileInputFormat:
> {code}
> public static class MyCustomAvroKey extends AvroKey{};
> public static class MyCustomAvroReader extends 
> AvroRecordReaderBase {
> // with my custom schema and all the required methods...
> }
> public static class MyCustomInputFormat extends 
> FileInputFormat{
> @Override
> public RecordReader 
> createRecordReader(InputSplit inputSplit, TaskAttemptContext 
> taskAttemptContext) throws IOException, InterruptedException {
> return new MyCustomAvroReader();
> }
> }
> ...
> JavaPairRDD records =
> sc.newAPIHadoopFile("file:/path/to/datafile.avro",
> MyCustomInputFormat.class, MyCustomAvroKey.class,
> NullWritable.class,
> sc.hadoopConfiguration());
> MyCustomClass first = records.first()._1.datum();
> System.out.println("Got a result, some custom field: " + 
> first.getSomeCustomField());
> {code}
> This compiles fine, but using a debugger I can see that `first._1.datum()` 
> actually returns a `GenericData$Record` in runtime, not a `MyCustomClass` 
> instance.
> And indeed, when the following line executes:
> {code}
> MyCustomClass first = records.first()._1.datum();
> {code}
> I get an exception:
> {code}
> java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record 
> cannot be cast to my.package.containing.MyCustomClass
> {code}
> Am I doing it wrong? Or is this not possible in Java?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-19656) Can't load custom type from avro file to RDD with newAPIHadoopFile

2017-03-05 Thread Nira Amit (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15896286#comment-15896286
 ] 

Nira Amit edited comment on SPARK-19656 at 3/5/17 2:56 PM:
---

But then why does the compiler accept what newAPIHadoopFile returns as 
MyCustomClass? If what you are saying is correct, then the only return type 
that should be accepted is a GenericData$Record or something that can be casted 
to it.


was (Author: amitnira):
But then why does the compiler accept what newAPIHadoopFile returns as 
MyCustomClass? If what you are saying is correct, than the only return type 
that should be acceptable is a GenericData$Record or something that can be 
casted to it.

> Can't load custom type from avro file to RDD with newAPIHadoopFile
> --
>
> Key: SPARK-19656
> URL: https://issues.apache.org/jira/browse/SPARK-19656
> Project: Spark
>  Issue Type: Question
>  Components: Java API
>Affects Versions: 2.0.2
>Reporter: Nira Amit
>
> If I understand correctly, in scala it's possible to load custom objects from 
> avro files to RDDs this way:
> {code}
> ctx.hadoopFile("/path/to/the/avro/file.avro",
>   classOf[AvroInputFormat[MyClassInAvroFile]],
>   classOf[AvroWrapper[MyClassInAvroFile]],
>   classOf[NullWritable])
> {code}
> I'm not a scala developer, so I tried to "translate" this to java as best I 
> could. I created classes that extend AvroKey and FileInputFormat:
> {code}
> public static class MyCustomAvroKey extends AvroKey{};
> public static class MyCustomAvroReader extends 
> AvroRecordReaderBase {
> // with my custom schema and all the required methods...
> }
> public static class MyCustomInputFormat extends 
> FileInputFormat{
> @Override
> public RecordReader 
> createRecordReader(InputSplit inputSplit, TaskAttemptContext 
> taskAttemptContext) throws IOException, InterruptedException {
> return new MyCustomAvroReader();
> }
> }
> ...
> JavaPairRDD records =
> sc.newAPIHadoopFile("file:/path/to/datafile.avro",
> MyCustomInputFormat.class, MyCustomAvroKey.class,
> NullWritable.class,
> sc.hadoopConfiguration());
> MyCustomClass first = records.first()._1.datum();
> System.out.println("Got a result, some custom field: " + 
> first.getSomeCustomField());
> {code}
> This compiles fine, but using a debugger I can see that `first._1.datum()` 
> actually returns a `GenericData$Record` in runtime, not a `MyCustomClass` 
> instance.
> And indeed, when the following line executes:
> {code}
> MyCustomClass first = records.first()._1.datum();
> {code}
> I get an exception:
> {code}
> java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record 
> cannot be cast to my.package.containing.MyCustomClass
> {code}
> Am I doing it wrong? Or is this not possible in Java?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19656) Can't load custom type from avro file to RDD with newAPIHadoopFile

2017-03-05 Thread Nira Amit (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15896286#comment-15896286
 ] 

Nira Amit commented on SPARK-19656:
---

But then why does the compiler accepts what newAPIHadoopFile returns as 
MyCustomClass? If what you are saying is correct, than the only return type 
that should be acceptable is a GenericData$Record or something that can be 
casted to it.

> Can't load custom type from avro file to RDD with newAPIHadoopFile
> --
>
> Key: SPARK-19656
> URL: https://issues.apache.org/jira/browse/SPARK-19656
> Project: Spark
>  Issue Type: Question
>  Components: Java API
>Affects Versions: 2.0.2
>Reporter: Nira Amit
>
> If I understand correctly, in scala it's possible to load custom objects from 
> avro files to RDDs this way:
> {code}
> ctx.hadoopFile("/path/to/the/avro/file.avro",
>   classOf[AvroInputFormat[MyClassInAvroFile]],
>   classOf[AvroWrapper[MyClassInAvroFile]],
>   classOf[NullWritable])
> {code}
> I'm not a scala developer, so I tried to "translate" this to java as best I 
> could. I created classes that extend AvroKey and FileInputFormat:
> {code}
> public static class MyCustomAvroKey extends AvroKey{};
> public static class MyCustomAvroReader extends 
> AvroRecordReaderBase {
> // with my custom schema and all the required methods...
> }
> public static class MyCustomInputFormat extends 
> FileInputFormat{
> @Override
> public RecordReader 
> createRecordReader(InputSplit inputSplit, TaskAttemptContext 
> taskAttemptContext) throws IOException, InterruptedException {
> return new MyCustomAvroReader();
> }
> }
> ...
> JavaPairRDD records =
> sc.newAPIHadoopFile("file:/path/to/datafile.avro",
> MyCustomInputFormat.class, MyCustomAvroKey.class,
> NullWritable.class,
> sc.hadoopConfiguration());
> MyCustomClass first = records.first()._1.datum();
> System.out.println("Got a result, some custom field: " + 
> first.getSomeCustomField());
> {code}
> This compiles fine, but using a debugger I can see that `first._1.datum()` 
> actually returns a `GenericData$Record` in runtime, not a `MyCustomClass` 
> instance.
> And indeed, when the following line executes:
> {code}
> MyCustomClass first = records.first()._1.datum();
> {code}
> I get an exception:
> {code}
> java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record 
> cannot be cast to my.package.containing.MyCustomClass
> {code}
> Am I doing it wrong? Or is this not possible in Java?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19656) Can't load custom type from avro file to RDD with newAPIHadoopFile

2017-03-05 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15896282#comment-15896282
 ] 

Sean Owen commented on SPARK-19656:
---

Well, at the least, I'd suggest posting a more compilable example. But the last 
point is I think your problem: you are correctly getting a GenericData$Record 
because that is what is in the file. You need to call methods on that object to 
get your type object out. That's an Avro usage issue in your code. You need to 
investigate that before opening a JIRA.

> Can't load custom type from avro file to RDD with newAPIHadoopFile
> --
>
> Key: SPARK-19656
> URL: https://issues.apache.org/jira/browse/SPARK-19656
> Project: Spark
>  Issue Type: Question
>  Components: Java API
>Affects Versions: 2.0.2
>Reporter: Nira Amit
>
> If I understand correctly, in scala it's possible to load custom objects from 
> avro files to RDDs this way:
> {code}
> ctx.hadoopFile("/path/to/the/avro/file.avro",
>   classOf[AvroInputFormat[MyClassInAvroFile]],
>   classOf[AvroWrapper[MyClassInAvroFile]],
>   classOf[NullWritable])
> {code}
> I'm not a scala developer, so I tried to "translate" this to java as best I 
> could. I created classes that extend AvroKey and FileInputFormat:
> {code}
> public static class MyCustomAvroKey extends AvroKey{};
> public static class MyCustomAvroReader extends 
> AvroRecordReaderBase {
> // with my custom schema and all the required methods...
> }
> public static class MyCustomInputFormat extends 
> FileInputFormat{
> @Override
> public RecordReader 
> createRecordReader(InputSplit inputSplit, TaskAttemptContext 
> taskAttemptContext) throws IOException, InterruptedException {
> return new MyCustomAvroReader();
> }
> }
> ...
> JavaPairRDD records =
> sc.newAPIHadoopFile("file:/path/to/datafile.avro",
> MyCustomInputFormat.class, MyCustomAvroKey.class,
> NullWritable.class,
> sc.hadoopConfiguration());
> MyCustomClass first = records.first()._1.datum();
> System.out.println("Got a result, some custom field: " + 
> first.getSomeCustomField());
> {code}
> This compiles fine, but using a debugger I can see that `first._1.datum()` 
> actually returns a `GenericData$Record` in runtime, not a `MyCustomClass` 
> instance.
> And indeed, when the following line executes:
> {code}
> MyCustomClass first = records.first()._1.datum();
> {code}
> I get an exception:
> {code}
> java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record 
> cannot be cast to my.package.containing.MyCustomClass
> {code}
> Am I doing it wrong? Or is this not possible in Java?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19656) Can't load custom type from avro file to RDD with newAPIHadoopFile

2017-03-05 Thread Nira Amit (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15896277#comment-15896277
 ] 

Nira Amit commented on SPARK-19656:
---

The only reason my code sample doesn't compile is because it doesn't include my 
actual custom class implementation. Otherwise it's a copy-paste of my valid 
code which compiles, runs, and then crushes due to a RuntimeException. And that 
is because the class it's getting in runtime isn't what the compiler gave it. I 
understand that the solution is to migrate my code to Datasets. But this seems 
like a problem in the newAPIHadoopFile API.

> Can't load custom type from avro file to RDD with newAPIHadoopFile
> --
>
> Key: SPARK-19656
> URL: https://issues.apache.org/jira/browse/SPARK-19656
> Project: Spark
>  Issue Type: Question
>  Components: Java API
>Affects Versions: 2.0.2
>Reporter: Nira Amit
>
> If I understand correctly, in scala it's possible to load custom objects from 
> avro files to RDDs this way:
> {code}
> ctx.hadoopFile("/path/to/the/avro/file.avro",
>   classOf[AvroInputFormat[MyClassInAvroFile]],
>   classOf[AvroWrapper[MyClassInAvroFile]],
>   classOf[NullWritable])
> {code}
> I'm not a scala developer, so I tried to "translate" this to java as best I 
> could. I created classes that extend AvroKey and FileInputFormat:
> {code}
> public static class MyCustomAvroKey extends AvroKey{};
> public static class MyCustomAvroReader extends 
> AvroRecordReaderBase {
> // with my custom schema and all the required methods...
> }
> public static class MyCustomInputFormat extends 
> FileInputFormat{
> @Override
> public RecordReader 
> createRecordReader(InputSplit inputSplit, TaskAttemptContext 
> taskAttemptContext) throws IOException, InterruptedException {
> return new MyCustomAvroReader();
> }
> }
> ...
> JavaPairRDD records =
> sc.newAPIHadoopFile("file:/path/to/datafile.avro",
> MyCustomInputFormat.class, MyCustomAvroKey.class,
> NullWritable.class,
> sc.hadoopConfiguration());
> MyCustomClass first = records.first()._1.datum();
> System.out.println("Got a result, some custom field: " + 
> first.getSomeCustomField());
> {code}
> This compiles fine, but using a debugger I can see that `first._1.datum()` 
> actually returns a `GenericData$Record` in runtime, not a `MyCustomClass` 
> instance.
> And indeed, when the following line executes:
> {code}
> MyCustomClass first = records.first()._1.datum();
> {code}
> I get an exception:
> {code}
> java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record 
> cannot be cast to my.package.containing.MyCustomClass
> {code}
> Am I doing it wrong? Or is this not possible in Java?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19656) Can't load custom type from avro file to RDD with newAPIHadoopFile

2017-03-05 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15896272#comment-15896272
 ] 

Sean Owen commented on SPARK-19656:
---

PS I should be concrete about why I think the original code doesn't work -- it 
doesn't compile because you're using newAPIHadoopFile whereas the example you 
follow uses hadoopFile. If you adjusted that, then I think you're getting back 
an Avro GenericRecord as expected. Avro has its own records in a file, not your 
objects. You need to get() your type out of it?

But that's an issue in your code. I think the reason this went to DataFrame / 
Dataset is that there is first-class support for Avro there where your types 
get unpacked. That's the righter way to do this anyway, although, shouldn't be 
much reason you can't do this with RDDs if you must.

> Can't load custom type from avro file to RDD with newAPIHadoopFile
> --
>
> Key: SPARK-19656
> URL: https://issues.apache.org/jira/browse/SPARK-19656
> Project: Spark
>  Issue Type: Question
>  Components: Java API
>Affects Versions: 2.0.2
>Reporter: Nira Amit
>
> If I understand correctly, in scala it's possible to load custom objects from 
> avro files to RDDs this way:
> {code}
> ctx.hadoopFile("/path/to/the/avro/file.avro",
>   classOf[AvroInputFormat[MyClassInAvroFile]],
>   classOf[AvroWrapper[MyClassInAvroFile]],
>   classOf[NullWritable])
> {code}
> I'm not a scala developer, so I tried to "translate" this to java as best I 
> could. I created classes that extend AvroKey and FileInputFormat:
> {code}
> public static class MyCustomAvroKey extends AvroKey{};
> public static class MyCustomAvroReader extends 
> AvroRecordReaderBase {
> // with my custom schema and all the required methods...
> }
> public static class MyCustomInputFormat extends 
> FileInputFormat{
> @Override
> public RecordReader 
> createRecordReader(InputSplit inputSplit, TaskAttemptContext 
> taskAttemptContext) throws IOException, InterruptedException {
> return new MyCustomAvroReader();
> }
> }
> ...
> JavaPairRDD records =
> sc.newAPIHadoopFile("file:/path/to/datafile.avro",
> MyCustomInputFormat.class, MyCustomAvroKey.class,
> NullWritable.class,
> sc.hadoopConfiguration());
> MyCustomClass first = records.first()._1.datum();
> System.out.println("Got a result, some custom field: " + 
> first.getSomeCustomField());
> {code}
> This compiles fine, but using a debugger I can see that `first._1.datum()` 
> actually returns a `GenericData$Record` in runtime, not a `MyCustomClass` 
> instance.
> And indeed, when the following line executes:
> {code}
> MyCustomClass first = records.first()._1.datum();
> {code}
> I get an exception:
> {code}
> java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record 
> cannot be cast to my.package.containing.MyCustomClass
> {code}
> Am I doing it wrong? Or is this not possible in Java?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19656) Can't load custom type from avro file to RDD with newAPIHadoopFile

2017-03-05 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15896268#comment-15896268
 ] 

Sean Owen commented on SPARK-19656:
---

Yes, I just tried to compile your code example above, and it doesn't work, but, 
for more basic reasons. That much is "Not A Problem" because you've got more 
basic usage errors. That is, this is not an example of code that should work 
but doesn't due to Avro issues. To the additional narrower question of  whether 
Datasets and casting works, it does, and I verified that it compiles.

> Can't load custom type from avro file to RDD with newAPIHadoopFile
> --
>
> Key: SPARK-19656
> URL: https://issues.apache.org/jira/browse/SPARK-19656
> Project: Spark
>  Issue Type: Question
>  Components: Java API
>Affects Versions: 2.0.2
>Reporter: Nira Amit
>
> If I understand correctly, in scala it's possible to load custom objects from 
> avro files to RDDs this way:
> {code}
> ctx.hadoopFile("/path/to/the/avro/file.avro",
>   classOf[AvroInputFormat[MyClassInAvroFile]],
>   classOf[AvroWrapper[MyClassInAvroFile]],
>   classOf[NullWritable])
> {code}
> I'm not a scala developer, so I tried to "translate" this to java as best I 
> could. I created classes that extend AvroKey and FileInputFormat:
> {code}
> public static class MyCustomAvroKey extends AvroKey{};
> public static class MyCustomAvroReader extends 
> AvroRecordReaderBase {
> // with my custom schema and all the required methods...
> }
> public static class MyCustomInputFormat extends 
> FileInputFormat{
> @Override
> public RecordReader 
> createRecordReader(InputSplit inputSplit, TaskAttemptContext 
> taskAttemptContext) throws IOException, InterruptedException {
> return new MyCustomAvroReader();
> }
> }
> ...
> JavaPairRDD records =
> sc.newAPIHadoopFile("file:/path/to/datafile.avro",
> MyCustomInputFormat.class, MyCustomAvroKey.class,
> NullWritable.class,
> sc.hadoopConfiguration());
> MyCustomClass first = records.first()._1.datum();
> System.out.println("Got a result, some custom field: " + 
> first.getSomeCustomField());
> {code}
> This compiles fine, but using a debugger I can see that `first._1.datum()` 
> actually returns a `GenericData$Record` in runtime, not a `MyCustomClass` 
> instance.
> And indeed, when the following line executes:
> {code}
> MyCustomClass first = records.first()._1.datum();
> {code}
> I get an exception:
> {code}
> java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record 
> cannot be cast to my.package.containing.MyCustomClass
> {code}
> Am I doing it wrong? Or is this not possible in Java?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19656) Can't load custom type from avro file to RDD with newAPIHadoopFile

2017-03-05 Thread Nira Amit (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15896266#comment-15896266
 ] 

Nira Amit commented on SPARK-19656:
---

[~sowen] I have been trying this for weeks every way I could possibly think of. 
Have you (really) tried any of my code samples? With RDDs, not Datasets? If 
this is not possible with the newAPIHadoopFile then it's not an issue for 
mailing lists but rather be mentioned explicitly in the documentation because 
apparently I'm not the only one who expected this to work and couldn't figure 
out what I'm doing wrong.

> Can't load custom type from avro file to RDD with newAPIHadoopFile
> --
>
> Key: SPARK-19656
> URL: https://issues.apache.org/jira/browse/SPARK-19656
> Project: Spark
>  Issue Type: Question
>  Components: Java API
>Affects Versions: 2.0.2
>Reporter: Nira Amit
>
> If I understand correctly, in scala it's possible to load custom objects from 
> avro files to RDDs this way:
> {code}
> ctx.hadoopFile("/path/to/the/avro/file.avro",
>   classOf[AvroInputFormat[MyClassInAvroFile]],
>   classOf[AvroWrapper[MyClassInAvroFile]],
>   classOf[NullWritable])
> {code}
> I'm not a scala developer, so I tried to "translate" this to java as best I 
> could. I created classes that extend AvroKey and FileInputFormat:
> {code}
> public static class MyCustomAvroKey extends AvroKey{};
> public static class MyCustomAvroReader extends 
> AvroRecordReaderBase {
> // with my custom schema and all the required methods...
> }
> public static class MyCustomInputFormat extends 
> FileInputFormat{
> @Override
> public RecordReader 
> createRecordReader(InputSplit inputSplit, TaskAttemptContext 
> taskAttemptContext) throws IOException, InterruptedException {
> return new MyCustomAvroReader();
> }
> }
> ...
> JavaPairRDD records =
> sc.newAPIHadoopFile("file:/path/to/datafile.avro",
> MyCustomInputFormat.class, MyCustomAvroKey.class,
> NullWritable.class,
> sc.hadoopConfiguration());
> MyCustomClass first = records.first()._1.datum();
> System.out.println("Got a result, some custom field: " + 
> first.getSomeCustomField());
> {code}
> This compiles fine, but using a debugger I can see that `first._1.datum()` 
> actually returns a `GenericData$Record` in runtime, not a `MyCustomClass` 
> instance.
> And indeed, when the following line executes:
> {code}
> MyCustomClass first = records.first()._1.datum();
> {code}
> I get an exception:
> {code}
> java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record 
> cannot be cast to my.package.containing.MyCustomClass
> {code}
> Am I doing it wrong? Or is this not possible in Java?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19656) Can't load custom type from avro file to RDD with newAPIHadoopFile

2017-03-05 Thread Nira Amit (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15896262#comment-15896262
 ] 

Nira Amit commented on SPARK-19656:
---

Thanks Eric, but my question is about RDDs. Is it correct that it is not 
possible to load custom classes directly to RDDs in Java? Only to Dataframes?

> Can't load custom type from avro file to RDD with newAPIHadoopFile
> --
>
> Key: SPARK-19656
> URL: https://issues.apache.org/jira/browse/SPARK-19656
> Project: Spark
>  Issue Type: Question
>  Components: Java API
>Affects Versions: 2.0.2
>Reporter: Nira Amit
>
> If I understand correctly, in scala it's possible to load custom objects from 
> avro files to RDDs this way:
> {code}
> ctx.hadoopFile("/path/to/the/avro/file.avro",
>   classOf[AvroInputFormat[MyClassInAvroFile]],
>   classOf[AvroWrapper[MyClassInAvroFile]],
>   classOf[NullWritable])
> {code}
> I'm not a scala developer, so I tried to "translate" this to java as best I 
> could. I created classes that extend AvroKey and FileInputFormat:
> {code}
> public static class MyCustomAvroKey extends AvroKey{};
> public static class MyCustomAvroReader extends 
> AvroRecordReaderBase {
> // with my custom schema and all the required methods...
> }
> public static class MyCustomInputFormat extends 
> FileInputFormat{
> @Override
> public RecordReader 
> createRecordReader(InputSplit inputSplit, TaskAttemptContext 
> taskAttemptContext) throws IOException, InterruptedException {
> return new MyCustomAvroReader();
> }
> }
> ...
> JavaPairRDD records =
> sc.newAPIHadoopFile("file:/path/to/datafile.avro",
> MyCustomInputFormat.class, MyCustomAvroKey.class,
> NullWritable.class,
> sc.hadoopConfiguration());
> MyCustomClass first = records.first()._1.datum();
> System.out.println("Got a result, some custom field: " + 
> first.getSomeCustomField());
> {code}
> This compiles fine, but using a debugger I can see that `first._1.datum()` 
> actually returns a `GenericData$Record` in runtime, not a `MyCustomClass` 
> instance.
> And indeed, when the following line executes:
> {code}
> MyCustomClass first = records.first()._1.datum();
> {code}
> I get an exception:
> {code}
> java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record 
> cannot be cast to my.package.containing.MyCustomClass
> {code}
> Am I doing it wrong? Or is this not possible in Java?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16102) Use Record API from Univocity rather than current data cast API.

2017-03-05 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15896247#comment-15896247
 ] 

Hyukjin Kwon commented on SPARK-16102:
--

This takes longer than I thought. Let me update this soon.

> Use Record API from Univocity rather than current data cast API.
> 
>
> Key: SPARK-16102
> URL: https://issues.apache.org/jira/browse/SPARK-16102
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>
> There is Record API for Univocity parser.
> This API provides typed data. Spark currently tries to compare and cast each 
> data.
> Using this library should reduce the codes in Spark and maybe improve the 
> performance. 
> It seems a benchmark should be proceeded first.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19254) Support Seq, Map, and Struct in functions.lit

2017-03-05 Thread Herman van Hovell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell resolved SPARK-19254.
---
   Resolution: Fixed
 Assignee: Takeshi Yamamuro
Fix Version/s: 2.2.0

> Support Seq, Map, and Struct in functions.lit
> -
>
> Key: SPARK-19254
> URL: https://issues.apache.org/jira/browse/SPARK-19254
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
>Priority: Minor
> Fix For: 2.2.0
>
>
> In the current implementation, function.lit does not support Seq, Map, and 
> Struct. This ticket is intended to support them. This is the follow-up of 
> https://issues.apache.org/jira/browse/SPARK-17683.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-19656) Can't load custom type from avro file to RDD with newAPIHadoopFile

2017-03-05 Thread Eric Maynard (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15896194#comment-15896194
 ] 

Eric Maynard edited comment on SPARK-19656 at 3/5/17 11:51 AM:
---

Here is a complete working example in Java:

{code:title=AvroTest.java|borderStyle=solid}
public class AvroTest {

public static void main(String[] args){

//build spark session:
System.setProperty("hadoop.home.dir", "C:\\Hadoop");//windows hack
SparkSession spark = 
SparkSession.builder().master("local").appName("Avro Test")
.config("spark.sql.warehouse.dir", 
"file:///c:/tmp/spark-warehouse")//another windows hack
.getOrCreate();

//create data:
ArrayList list = new ArrayList();
CustomClass cc = new CustomClass();
cc.setA(5);
cc.setB(6);
list.add(cc);
spark.createDataFrame(list, 
CustomClass.class).write().mode(SaveMode.Overwrite).format("com.databricks.spark.avro").save("C:\\tmp\\file.avro");

//read data:
Row row = 
(spark.read().format("com.databricks.spark.avro").load("C:\\tmp\\file.avro").head());
System.out.println(row);
System.out.println(row.get(0));
System.out.println(row.get(1));
System.out.println("Success =\t" + ((Integer)row.get(0) == 5));
}
}
{code}

With a simple custom class:
{code:title=CustomClass.java|borderStyle=solid}
import java.io.Serializable;

public class CustomClass implements Serializable {
private int a;
public void setA(int value){this.a = value;}
public int getA(){return this.a;}

private int b;
public void setB(int value) {this.b = value;}
public int getB(){return this.b;}
}
{code}  
  
Everything looks ok to me, and after running stdout looks like this:
{code}
[5,6]
5
6
Success =   true
{code}
  
In the future please make sure that you don't have an issue in your application 
before opening a JIRA. Also, as an aside, I really recommend picking up some 
Scala as IMO the Scala API is much friendlier, esp. around the edges for things 
like the avro library.


was (Author: emaynard):
Here is a complete working example in Java:

{code:title=AvroTest.java|borderStyle=solid}
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
import java.util.ArrayList;

public class AvroTest {

public static void main(String[] args){

//build spark session:
System.setProperty("hadoop.home.dir", "C:\\Hadoop");//windows hack
SparkSession spark = 
SparkSession.builder().master("local").appName("Avro Test")
.config("spark.sql.warehouse.dir", 
"file:///c:/tmp/spark-warehouse")//another windows hack
.getOrCreate();

//create data:
ArrayList list = new ArrayList();
CustomClass cc = new CustomClass();
cc.setValue(5);
list.add(cc);
spark.createDataFrame(list, 
CustomClass.class).write().format("com.databricks.spark.avro").save("C:\\tmp\\file.avro");

//read data:
Row row = 
(spark.read().format("com.databricks.spark.avro").load("C:\\tmp\\file.avro").head());
System.out.println("Success =\t" + ((Integer)row.get(0) == 5));
}
}



{code}

With a simple custom class:
{code:title=CustomClass.java|borderStyle=solid}
import java.io.Serializable;

public class CustomClass implements Serializable {
public int value;
public void setValue(int value){this.value = value;}
public int getValue(){return this.value;}
}
{code}  
  
Everything looks ok to me, and the main function prints "Success = true". In 
the future please make sure that you don't have an issue in your application 
before opening a JIRA. Also, as an aside, I really recommend picking up some 
Scala as IMO the Scala API is much friendlier, esp. around the edges for things 
like the avro library.

> Can't load custom type from avro file to RDD with newAPIHadoopFile
> --
>
> Key: SPARK-19656
> URL: https://issues.apache.org/jira/browse/SPARK-19656
> Project: Spark
>  Issue Type: Question
>  Components: Java API
>Affects Versions: 2.0.2
>Reporter: Nira Amit
>
> If I understand correctly, in scala it's possible to load custom objects from 
> avro files to RDDs this way:
> {code}
> ctx.hadoopFile("/path/to/the/avro/file.avro",
>   classOf[AvroInputFormat[MyClassInAvroFile]],
>   classOf[AvroWrapper[MyClassInAvroFile]],
>   classOf[NullWritable])
> {code}
> I'm not a scala developer, so I tried to "translate" this to java as best I 
> could. I created classes that extend AvroKey and FileInputFormat:
> {code}
> public static class MyCustomAvroKey extends AvroKey{};
> public static class MyCustomAvroReader extends 
> AvroRecordReaderBase

[jira] [Commented] (SPARK-19656) Can't load custom type from avro file to RDD with newAPIHadoopFile

2017-03-05 Thread Eric Maynard (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15896194#comment-15896194
 ] 

Eric Maynard commented on SPARK-19656:
--

Here is a complete working example in Java:

{code:title=AvroTest.java|borderStyle=solid}
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
import java.util.ArrayList;

public class AvroTest {

public static void main(String[] args){

//build spark session:
System.setProperty("hadoop.home.dir", "C:\\Hadoop");//windows hack
SparkSession spark = 
SparkSession.builder().master("local").appName("Avro Test")
.config("spark.sql.warehouse.dir", 
"file:///c:/tmp/spark-warehouse")//another windows hack
.getOrCreate();

//create data:
ArrayList list = new ArrayList();
CustomClass cc = new CustomClass();
cc.setValue(5);
list.add(cc);
spark.createDataFrame(list, 
CustomClass.class).write().format("com.databricks.spark.avro").save("C:\\tmp\\file.avro");

//read data:
Row row = 
(spark.read().format("com.databricks.spark.avro").load("C:\\tmp\\file.avro").head());
System.out.println("Success =\t" + ((Integer)row.get(0) == 5));
}
}



{code}

With a simple custom class:
{code:title=CustomClass.java|borderStyle=solid}
import java.io.Serializable;

public class CustomClass implements Serializable {
public int value;
public void setValue(int value){this.value = value;}
public int getValue(){return this.value;}
}
{code}  
  
Everything looks ok to me, and the main function prints "Success = true". In 
the future please make sure that you don't have an issue in your application 
before opening a JIRA. Also, as an aside, I really recommend picking up some 
Scala as IMO the Scala API is much friendlier, esp. around the edges for things 
like the avro library.

> Can't load custom type from avro file to RDD with newAPIHadoopFile
> --
>
> Key: SPARK-19656
> URL: https://issues.apache.org/jira/browse/SPARK-19656
> Project: Spark
>  Issue Type: Question
>  Components: Java API
>Affects Versions: 2.0.2
>Reporter: Nira Amit
>
> If I understand correctly, in scala it's possible to load custom objects from 
> avro files to RDDs this way:
> {code}
> ctx.hadoopFile("/path/to/the/avro/file.avro",
>   classOf[AvroInputFormat[MyClassInAvroFile]],
>   classOf[AvroWrapper[MyClassInAvroFile]],
>   classOf[NullWritable])
> {code}
> I'm not a scala developer, so I tried to "translate" this to java as best I 
> could. I created classes that extend AvroKey and FileInputFormat:
> {code}
> public static class MyCustomAvroKey extends AvroKey{};
> public static class MyCustomAvroReader extends 
> AvroRecordReaderBase {
> // with my custom schema and all the required methods...
> }
> public static class MyCustomInputFormat extends 
> FileInputFormat{
> @Override
> public RecordReader 
> createRecordReader(InputSplit inputSplit, TaskAttemptContext 
> taskAttemptContext) throws IOException, InterruptedException {
> return new MyCustomAvroReader();
> }
> }
> ...
> JavaPairRDD records =
> sc.newAPIHadoopFile("file:/path/to/datafile.avro",
> MyCustomInputFormat.class, MyCustomAvroKey.class,
> NullWritable.class,
> sc.hadoopConfiguration());
> MyCustomClass first = records.first()._1.datum();
> System.out.println("Got a result, some custom field: " + 
> first.getSomeCustomField());
> {code}
> This compiles fine, but using a debugger I can see that `first._1.datum()` 
> actually returns a `GenericData$Record` in runtime, not a `MyCustomClass` 
> instance.
> And indeed, when the following line executes:
> {code}
> MyCustomClass first = records.first()._1.datum();
> {code}
> I get an exception:
> {code}
> java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record 
> cannot be cast to my.package.containing.MyCustomClass
> {code}
> Am I doing it wrong? Or is this not possible in Java?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19824) Standalone master JSON not showing cores for running applications

2017-03-05 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15896193#comment-15896193
 ] 

Sean Owen commented on SPARK-19824:
---

I guess it doesn't show "memory per executor" either? That came up yesterday.
I don't know the standalone master well but it does look like this is in the 
UI. I'm not sure it has to be in the JSON but seems reasonable to be consistent.

> Standalone master JSON not showing cores for running applications
> -
>
> Key: SPARK-19824
> URL: https://issues.apache.org/jira/browse/SPARK-19824
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 2.1.0
>Reporter: Dan
>Priority: Minor
>
> The JSON API of the standalone master ("/json") does not show the number of 
> cores for a running application, which is available on the UI.
>   "activeapps" : [ {
> "starttime" : 1488702337788,
> "id" : "app-20170305102537-19717",
> "name" : "POPAI_Aggregated",
> "user" : "ibiuser",
> "memoryperslave" : 16384,
> "submitdate" : "Sun Mar 05 10:25:37 IST 2017",
> "state" : "RUNNING",
> "duration" : 1141934
>   } ],



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19805) Log the row type when query result dose not match

2017-03-05 Thread Herman van Hovell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell resolved SPARK-19805.
---
   Resolution: Fixed
 Assignee: Genmao Yu
Fix Version/s: 2.2.0

> Log the row type when query result dose not match
> -
>
> Key: SPARK-19805
> URL: https://issues.apache.org/jira/browse/SPARK-19805
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Genmao Yu
>Assignee: Genmao Yu
>Priority: Minor
> Fix For: 2.2.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19823) Support Gang Distribution of Task

2017-03-05 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15896192#comment-15896192
 ] 

Sean Owen commented on SPARK-19823:
---

I don't think that's quite right. Task assignment takes into account locality, 
for example. Choosing to override locality preferences of the scheduler and 
batch-assign tasks to executors that are suboptimal for locality has drawbacks. 
You will generally observe that executors that are local to data that jobs need 
will get more tasks, and those that aren't will already be less busy, and 
therefore more quickly become idle and terminate. The effect you want already 
takes place indirectly.

> Support Gang Distribution of Task 
> --
>
> Key: SPARK-19823
> URL: https://issues.apache.org/jira/browse/SPARK-19823
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler, Spark Core
>Affects Versions: 2.0.2
>Reporter: DjvuLee
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6407) Streaming ALS for Collaborative Filtering

2017-03-05 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15896190#comment-15896190
 ] 

Sean Owen commented on SPARK-6407:
--

I did some work on this, but it's not a paper or anything, just some code, in 
and around these bits of code, which try to compute new user/item updates on 
the fly:

https://github.com/OryxProject/oryx/blob/master/app/oryx-app/src/main/java/com/cloudera/oryx/app/speed/als/ALSSpeedModelManager.java#L198
https://github.com/OryxProject/oryx/blob/master/app/oryx-app-common/src/main/java/com/cloudera/oryx/app/als/ALSUtils.java

The choices about the semantics of the updates are in ALSUtils. If you dig into 
it, we can discuss offline and I can probably write more in the docs to make it 
clearer what's happening.


> Streaming ALS for Collaborative Filtering
> -
>
> Key: SPARK-6407
> URL: https://issues.apache.org/jira/browse/SPARK-6407
> Project: Spark
>  Issue Type: New Feature
>  Components: DStreams
>Reporter: Felix Cheung
>Priority: Minor
>
> Like MLLib's ALS implementation for recommendation, and applying to streaming.
> Similar to streaming linear regression, logistic regression, could we apply 
> gradient updates to batches of data and reuse existing MLLib implementation?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19656) Can't load custom type from avro file to RDD with newAPIHadoopFile

2017-03-05 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15896186#comment-15896186
 ] 

Sean Owen commented on SPARK-19656:
---

I guess I mean, have you really tried it? it doesn't result in a compile error, 
and you didn't say what the compile error is. This works, yes:

{code}
  public static class Foo {}

 ...
Dataset ds = spark.read;
ds.map((MapFunction) row -> (Foo) row.get(0), new MyFooEncoder());
 ...
{code}

Meaning, the cast in question works and you can map to a new Dataset if you 
have an encoder for your type.

The rest of the example you provide above doesn't work; it looks like a Hadoop 
API version problem. That's up to your code though. You're trying to use old 
Hadoop API Avro classes with newAPIHadoopFile.

This should be on the mailing list until it's narrowed down.

> Can't load custom type from avro file to RDD with newAPIHadoopFile
> --
>
> Key: SPARK-19656
> URL: https://issues.apache.org/jira/browse/SPARK-19656
> Project: Spark
>  Issue Type: Question
>  Components: Java API
>Affects Versions: 2.0.2
>Reporter: Nira Amit
>
> If I understand correctly, in scala it's possible to load custom objects from 
> avro files to RDDs this way:
> {code}
> ctx.hadoopFile("/path/to/the/avro/file.avro",
>   classOf[AvroInputFormat[MyClassInAvroFile]],
>   classOf[AvroWrapper[MyClassInAvroFile]],
>   classOf[NullWritable])
> {code}
> I'm not a scala developer, so I tried to "translate" this to java as best I 
> could. I created classes that extend AvroKey and FileInputFormat:
> {code}
> public static class MyCustomAvroKey extends AvroKey{};
> public static class MyCustomAvroReader extends 
> AvroRecordReaderBase {
> // with my custom schema and all the required methods...
> }
> public static class MyCustomInputFormat extends 
> FileInputFormat{
> @Override
> public RecordReader 
> createRecordReader(InputSplit inputSplit, TaskAttemptContext 
> taskAttemptContext) throws IOException, InterruptedException {
> return new MyCustomAvroReader();
> }
> }
> ...
> JavaPairRDD records =
> sc.newAPIHadoopFile("file:/path/to/datafile.avro",
> MyCustomInputFormat.class, MyCustomAvroKey.class,
> NullWritable.class,
> sc.hadoopConfiguration());
> MyCustomClass first = records.first()._1.datum();
> System.out.println("Got a result, some custom field: " + 
> first.getSomeCustomField());
> {code}
> This compiles fine, but using a debugger I can see that `first._1.datum()` 
> actually returns a `GenericData$Record` in runtime, not a `MyCustomClass` 
> instance.
> And indeed, when the following line executes:
> {code}
> MyCustomClass first = records.first()._1.datum();
> {code}
> I get an exception:
> {code}
> java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record 
> cannot be cast to my.package.containing.MyCustomClass
> {code}
> Am I doing it wrong? Or is this not possible in Java?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19656) Can't load custom type from avro file to RDD with newAPIHadoopFile

2017-03-05 Thread Nira Amit (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15896179#comment-15896179
 ] 

Nira Amit commented on SPARK-19656:
---

Yes, I did, and answered him that it gives a compilation error in Java. Have 
you tried doing this in Java? If this is possible then please give a working 
code example, there should be no discussion if there is a correct way of doing 
this.

> Can't load custom type from avro file to RDD with newAPIHadoopFile
> --
>
> Key: SPARK-19656
> URL: https://issues.apache.org/jira/browse/SPARK-19656
> Project: Spark
>  Issue Type: Question
>  Components: Java API
>Affects Versions: 2.0.2
>Reporter: Nira Amit
>
> If I understand correctly, in scala it's possible to load custom objects from 
> avro files to RDDs this way:
> {code}
> ctx.hadoopFile("/path/to/the/avro/file.avro",
>   classOf[AvroInputFormat[MyClassInAvroFile]],
>   classOf[AvroWrapper[MyClassInAvroFile]],
>   classOf[NullWritable])
> {code}
> I'm not a scala developer, so I tried to "translate" this to java as best I 
> could. I created classes that extend AvroKey and FileInputFormat:
> {code}
> public static class MyCustomAvroKey extends AvroKey{};
> public static class MyCustomAvroReader extends 
> AvroRecordReaderBase {
> // with my custom schema and all the required methods...
> }
> public static class MyCustomInputFormat extends 
> FileInputFormat{
> @Override
> public RecordReader 
> createRecordReader(InputSplit inputSplit, TaskAttemptContext 
> taskAttemptContext) throws IOException, InterruptedException {
> return new MyCustomAvroReader();
> }
> }
> ...
> JavaPairRDD records =
> sc.newAPIHadoopFile("file:/path/to/datafile.avro",
> MyCustomInputFormat.class, MyCustomAvroKey.class,
> NullWritable.class,
> sc.hadoopConfiguration());
> MyCustomClass first = records.first()._1.datum();
> System.out.println("Got a result, some custom field: " + 
> first.getSomeCustomField());
> {code}
> This compiles fine, but using a debugger I can see that `first._1.datum()` 
> actually returns a `GenericData$Record` in runtime, not a `MyCustomClass` 
> instance.
> And indeed, when the following line executes:
> {code}
> MyCustomClass first = records.first()._1.datum();
> {code}
> I get an exception:
> {code}
> java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record 
> cannot be cast to my.package.containing.MyCustomClass
> {code}
> Am I doing it wrong? Or is this not possible in Java?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-6407) Streaming ALS for Collaborative Filtering

2017-03-05 Thread Daniel Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15896177#comment-15896177
 ] 

Daniel Li edited comment on SPARK-6407 at 3/5/17 10:57 AM:
---

{quote}
In practice fold-in works fine. Folding in a day or so of updates has been OK.
The question isn't RMSE but how it affects actual rankings of items in 
recommendations, and it takes a while before the effect of the approximation 
actually changes a rank.
{quote}

Hmm, I see.  This would be something I'd be interested in implementing for 
Spark if there's need.  Are there implementations (or papers) of this you know 
of that I could look at?


was (Author: danielyli):
bq. In practice fold-in works fine. Folding in a day or so of updates has been 
OK.
The question isn't RMSE but how it affects actual rankings of items in 
recommendations, and it takes a while before the effect of the approximation 
actually changes a rank.

Hmm, I see.  This would be something I'd be interested in implementing for 
Spark if there's need.  Are there implementations (or papers) of this you know 
of that I could look at?

> Streaming ALS for Collaborative Filtering
> -
>
> Key: SPARK-6407
> URL: https://issues.apache.org/jira/browse/SPARK-6407
> Project: Spark
>  Issue Type: New Feature
>  Components: DStreams
>Reporter: Felix Cheung
>Priority: Minor
>
> Like MLLib's ALS implementation for recommendation, and applying to streaming.
> Similar to streaming linear regression, logistic regression, could we apply 
> gradient updates to batches of data and reuse existing MLLib implementation?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6407) Streaming ALS for Collaborative Filtering

2017-03-05 Thread Daniel Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15896177#comment-15896177
 ] 

Daniel Li commented on SPARK-6407:
--

bq. In practice fold-in works fine. Folding in a day or so of updates has been 
OK.
The question isn't RMSE but how it affects actual rankings of items in 
recommendations, and it takes a while before the effect of the approximation 
actually changes a rank.

Hmm, I see.  This would be something I'd be interested in implementing for 
Spark if there's need.  Are there implementations (or papers) of this you know 
of that I could look at?

> Streaming ALS for Collaborative Filtering
> -
>
> Key: SPARK-6407
> URL: https://issues.apache.org/jira/browse/SPARK-6407
> Project: Spark
>  Issue Type: New Feature
>  Components: DStreams
>Reporter: Felix Cheung
>Priority: Minor
>
> Like MLLib's ALS implementation for recommendation, and applying to streaming.
> Similar to streaming linear regression, logistic regression, could we apply 
> gradient updates to batches of data and reuse existing MLLib implementation?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19656) Can't load custom type from avro file to RDD with newAPIHadoopFile

2017-03-05 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15896174#comment-15896174
 ] 

Sean Owen commented on SPARK-19656:
---

Have you tried Eric's suggestion? asInstanceOf is just casting in Java. That is 
the kind of discussion to have on the mailing list and homework to do before 
opening a JIRA.

> Can't load custom type from avro file to RDD with newAPIHadoopFile
> --
>
> Key: SPARK-19656
> URL: https://issues.apache.org/jira/browse/SPARK-19656
> Project: Spark
>  Issue Type: Question
>  Components: Java API
>Affects Versions: 2.0.2
>Reporter: Nira Amit
>
> If I understand correctly, in scala it's possible to load custom objects from 
> avro files to RDDs this way:
> {code}
> ctx.hadoopFile("/path/to/the/avro/file.avro",
>   classOf[AvroInputFormat[MyClassInAvroFile]],
>   classOf[AvroWrapper[MyClassInAvroFile]],
>   classOf[NullWritable])
> {code}
> I'm not a scala developer, so I tried to "translate" this to java as best I 
> could. I created classes that extend AvroKey and FileInputFormat:
> {code}
> public static class MyCustomAvroKey extends AvroKey{};
> public static class MyCustomAvroReader extends 
> AvroRecordReaderBase {
> // with my custom schema and all the required methods...
> }
> public static class MyCustomInputFormat extends 
> FileInputFormat{
> @Override
> public RecordReader 
> createRecordReader(InputSplit inputSplit, TaskAttemptContext 
> taskAttemptContext) throws IOException, InterruptedException {
> return new MyCustomAvroReader();
> }
> }
> ...
> JavaPairRDD records =
> sc.newAPIHadoopFile("file:/path/to/datafile.avro",
> MyCustomInputFormat.class, MyCustomAvroKey.class,
> NullWritable.class,
> sc.hadoopConfiguration());
> MyCustomClass first = records.first()._1.datum();
> System.out.println("Got a result, some custom field: " + 
> first.getSomeCustomField());
> {code}
> This compiles fine, but using a debugger I can see that `first._1.datum()` 
> actually returns a `GenericData$Record` in runtime, not a `MyCustomClass` 
> instance.
> And indeed, when the following line executes:
> {code}
> MyCustomClass first = records.first()._1.datum();
> {code}
> I get an exception:
> {code}
> java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record 
> cannot be cast to my.package.containing.MyCustomClass
> {code}
> Am I doing it wrong? Or is this not possible in Java?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19656) Can't load custom type from avro file to RDD with newAPIHadoopFile

2017-03-05 Thread Nira Amit (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15896172#comment-15896172
 ] 

Nira Amit commented on SPARK-19656:
---

But if this is not possible to do in Java then it IS an actionable change, 
isn't it? I already posted this question several weeks ago in StackOverflow and 
got many upvotes but no answer, which is why I posted it in the "Question" 
category of your Jira. Is it possible in Java or isn't it? From Eric's answer 
it sounds like it should be, yet nobody seems to know how to do it.

> Can't load custom type from avro file to RDD with newAPIHadoopFile
> --
>
> Key: SPARK-19656
> URL: https://issues.apache.org/jira/browse/SPARK-19656
> Project: Spark
>  Issue Type: Question
>  Components: Java API
>Affects Versions: 2.0.2
>Reporter: Nira Amit
>
> If I understand correctly, in scala it's possible to load custom objects from 
> avro files to RDDs this way:
> {code}
> ctx.hadoopFile("/path/to/the/avro/file.avro",
>   classOf[AvroInputFormat[MyClassInAvroFile]],
>   classOf[AvroWrapper[MyClassInAvroFile]],
>   classOf[NullWritable])
> {code}
> I'm not a scala developer, so I tried to "translate" this to java as best I 
> could. I created classes that extend AvroKey and FileInputFormat:
> {code}
> public static class MyCustomAvroKey extends AvroKey{};
> public static class MyCustomAvroReader extends 
> AvroRecordReaderBase {
> // with my custom schema and all the required methods...
> }
> public static class MyCustomInputFormat extends 
> FileInputFormat{
> @Override
> public RecordReader 
> createRecordReader(InputSplit inputSplit, TaskAttemptContext 
> taskAttemptContext) throws IOException, InterruptedException {
> return new MyCustomAvroReader();
> }
> }
> ...
> JavaPairRDD records =
> sc.newAPIHadoopFile("file:/path/to/datafile.avro",
> MyCustomInputFormat.class, MyCustomAvroKey.class,
> NullWritable.class,
> sc.hadoopConfiguration());
> MyCustomClass first = records.first()._1.datum();
> System.out.println("Got a result, some custom field: " + 
> first.getSomeCustomField());
> {code}
> This compiles fine, but using a debugger I can see that `first._1.datum()` 
> actually returns a `GenericData$Record` in runtime, not a `MyCustomClass` 
> instance.
> And indeed, when the following line executes:
> {code}
> MyCustomClass first = records.first()._1.datum();
> {code}
> I get an exception:
> {code}
> java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record 
> cannot be cast to my.package.containing.MyCustomClass
> {code}
> Am I doing it wrong? Or is this not possible in Java?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19656) Can't load custom type from avro file to RDD with newAPIHadoopFile

2017-03-05 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15896169#comment-15896169
 ] 

Sean Owen commented on SPARK-19656:
---

Mostly, it is that questions should go to the mailing list. I don't keep track 
of your JIRAs. This should be reserved for actionable changes not questions.

> Can't load custom type from avro file to RDD with newAPIHadoopFile
> --
>
> Key: SPARK-19656
> URL: https://issues.apache.org/jira/browse/SPARK-19656
> Project: Spark
>  Issue Type: Question
>  Components: Java API
>Affects Versions: 2.0.2
>Reporter: Nira Amit
>
> If I understand correctly, in scala it's possible to load custom objects from 
> avro files to RDDs this way:
> {code}
> ctx.hadoopFile("/path/to/the/avro/file.avro",
>   classOf[AvroInputFormat[MyClassInAvroFile]],
>   classOf[AvroWrapper[MyClassInAvroFile]],
>   classOf[NullWritable])
> {code}
> I'm not a scala developer, so I tried to "translate" this to java as best I 
> could. I created classes that extend AvroKey and FileInputFormat:
> {code}
> public static class MyCustomAvroKey extends AvroKey{};
> public static class MyCustomAvroReader extends 
> AvroRecordReaderBase {
> // with my custom schema and all the required methods...
> }
> public static class MyCustomInputFormat extends 
> FileInputFormat{
> @Override
> public RecordReader 
> createRecordReader(InputSplit inputSplit, TaskAttemptContext 
> taskAttemptContext) throws IOException, InterruptedException {
> return new MyCustomAvroReader();
> }
> }
> ...
> JavaPairRDD records =
> sc.newAPIHadoopFile("file:/path/to/datafile.avro",
> MyCustomInputFormat.class, MyCustomAvroKey.class,
> NullWritable.class,
> sc.hadoopConfiguration());
> MyCustomClass first = records.first()._1.datum();
> System.out.println("Got a result, some custom field: " + 
> first.getSomeCustomField());
> {code}
> This compiles fine, but using a debugger I can see that `first._1.datum()` 
> actually returns a `GenericData$Record` in runtime, not a `MyCustomClass` 
> instance.
> And indeed, when the following line executes:
> {code}
> MyCustomClass first = records.first()._1.datum();
> {code}
> I get an exception:
> {code}
> java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record 
> cannot be cast to my.package.containing.MyCustomClass
> {code}
> Am I doing it wrong? Or is this not possible in Java?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19656) Can't load custom type from avro file to RDD with newAPIHadoopFile

2017-03-05 Thread Nira Amit (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15896166#comment-15896166
 ] 

Nira Amit commented on SPARK-19656:
---

[~emaynard] There is no "asInstanceOf" method in the Java API. And if I try to 
cast it directly I get a compilation error.
[~sowen] Are you not handling tickets about the Java API? It's the second time 
you close a ticket I open about loading custom objects from Avro in Java and 
mark it as "Not a problem". Either this is not possible in Java, in which case 
it's at least a missing feature (and misleading, because it looks like it 
should be possible), or I'm not doing it right and in this case you can provide 
a working code example in Java.

> Can't load custom type from avro file to RDD with newAPIHadoopFile
> --
>
> Key: SPARK-19656
> URL: https://issues.apache.org/jira/browse/SPARK-19656
> Project: Spark
>  Issue Type: Question
>  Components: Java API
>Affects Versions: 2.0.2
>Reporter: Nira Amit
>
> If I understand correctly, in scala it's possible to load custom objects from 
> avro files to RDDs this way:
> {code}
> ctx.hadoopFile("/path/to/the/avro/file.avro",
>   classOf[AvroInputFormat[MyClassInAvroFile]],
>   classOf[AvroWrapper[MyClassInAvroFile]],
>   classOf[NullWritable])
> {code}
> I'm not a scala developer, so I tried to "translate" this to java as best I 
> could. I created classes that extend AvroKey and FileInputFormat:
> {code}
> public static class MyCustomAvroKey extends AvroKey{};
> public static class MyCustomAvroReader extends 
> AvroRecordReaderBase {
> // with my custom schema and all the required methods...
> }
> public static class MyCustomInputFormat extends 
> FileInputFormat{
> @Override
> public RecordReader 
> createRecordReader(InputSplit inputSplit, TaskAttemptContext 
> taskAttemptContext) throws IOException, InterruptedException {
> return new MyCustomAvroReader();
> }
> }
> ...
> JavaPairRDD records =
> sc.newAPIHadoopFile("file:/path/to/datafile.avro",
> MyCustomInputFormat.class, MyCustomAvroKey.class,
> NullWritable.class,
> sc.hadoopConfiguration());
> MyCustomClass first = records.first()._1.datum();
> System.out.println("Got a result, some custom field: " + 
> first.getSomeCustomField());
> {code}
> This compiles fine, but using a debugger I can see that `first._1.datum()` 
> actually returns a `GenericData$Record` in runtime, not a `MyCustomClass` 
> instance.
> And indeed, when the following line executes:
> {code}
> MyCustomClass first = records.first()._1.datum();
> {code}
> I get an exception:
> {code}
> java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record 
> cannot be cast to my.package.containing.MyCustomClass
> {code}
> Am I doing it wrong? Or is this not possible in Java?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19656) Can't load custom type from avro file to RDD with newAPIHadoopFile

2017-03-05 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-19656.
---
Resolution: Not A Problem

> Can't load custom type from avro file to RDD with newAPIHadoopFile
> --
>
> Key: SPARK-19656
> URL: https://issues.apache.org/jira/browse/SPARK-19656
> Project: Spark
>  Issue Type: Question
>  Components: Java API
>Affects Versions: 2.0.2
>Reporter: Nira Amit
>
> If I understand correctly, in scala it's possible to load custom objects from 
> avro files to RDDs this way:
> {code}
> ctx.hadoopFile("/path/to/the/avro/file.avro",
>   classOf[AvroInputFormat[MyClassInAvroFile]],
>   classOf[AvroWrapper[MyClassInAvroFile]],
>   classOf[NullWritable])
> {code}
> I'm not a scala developer, so I tried to "translate" this to java as best I 
> could. I created classes that extend AvroKey and FileInputFormat:
> {code}
> public static class MyCustomAvroKey extends AvroKey{};
> public static class MyCustomAvroReader extends 
> AvroRecordReaderBase {
> // with my custom schema and all the required methods...
> }
> public static class MyCustomInputFormat extends 
> FileInputFormat{
> @Override
> public RecordReader 
> createRecordReader(InputSplit inputSplit, TaskAttemptContext 
> taskAttemptContext) throws IOException, InterruptedException {
> return new MyCustomAvroReader();
> }
> }
> ...
> JavaPairRDD records =
> sc.newAPIHadoopFile("file:/path/to/datafile.avro",
> MyCustomInputFormat.class, MyCustomAvroKey.class,
> NullWritable.class,
> sc.hadoopConfiguration());
> MyCustomClass first = records.first()._1.datum();
> System.out.println("Got a result, some custom field: " + 
> first.getSomeCustomField());
> {code}
> This compiles fine, but using a debugger I can see that `first._1.datum()` 
> actually returns a `GenericData$Record` in runtime, not a `MyCustomClass` 
> instance.
> And indeed, when the following line executes:
> {code}
> MyCustomClass first = records.first()._1.datum();
> {code}
> I get an exception:
> {code}
> java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record 
> cannot be cast to my.package.containing.MyCustomClass
> {code}
> Am I doing it wrong? Or is this not possible in Java?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19792) In the Master Page,the column named “Memory per Node” ,I think it is not all right

2017-03-05 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-19792:
-

Assignee: liuxian

> In the Master Page,the column named “Memory per Node” ,I think  it is not all 
> right
> ---
>
> Key: SPARK-19792
> URL: https://issues.apache.org/jira/browse/SPARK-19792
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.1.0
>Reporter: liuxian
>Assignee: liuxian
>Priority: Trivial
> Fix For: 2.2.0
>
>
> Open the spark web page,in the Master Page ,have two tables:Running 
> Applications table and  Completed Applications table, to the column named 
> “Memory per Node” ,I think it is not all right ,because a node may be not 
> have only one executor.So I think that should be named as “Memory per 
> Executor”.Otherwise easy to let the user misunderstanding



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19792) In the Master Page,the column named “Memory per Node” ,I think it is not all right

2017-03-05 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-19792.
---
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 17132
[https://github.com/apache/spark/pull/17132]

> In the Master Page,the column named “Memory per Node” ,I think  it is not all 
> right
> ---
>
> Key: SPARK-19792
> URL: https://issues.apache.org/jira/browse/SPARK-19792
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.1.0
>Reporter: liuxian
>Priority: Trivial
> Fix For: 2.2.0
>
>
> Open the spark web page,in the Master Page ,have two tables:Running 
> Applications table and  Completed Applications table, to the column named 
> “Memory per Node” ,I think it is not all right ,because a node may be not 
> have only one executor.So I think that should be named as “Memory per 
> Executor”.Otherwise easy to let the user misunderstanding



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19713) saveAsTable

2017-03-05 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-19713.
---
Resolution: Not A Problem

> saveAsTable
> ---
>
> Key: SPARK-19713
> URL: https://issues.apache.org/jira/browse/SPARK-19713
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Balaram R Gadiraju
>
> Hi,
> I just observed that when we use dataframe.saveAsTable("table") -- In 
> oldversions
> and dataframe.write.saveAsTable("table") -- in the newer versions
> When using the method “df3.saveAsTable("brokentable")” in 
> scale code. This creates a folder in hdfs and doesn’t update hive-metastore 
> that it plans to create the table. So if anything goes wrong in between the 
> folder still exists and hive is not aware of the folder creation. This will 
> block the users from creating the table “brokentable” as the folder already 
> exists, we can remove the folder using “hadoop fs –rmr 
> /data/hive/databases/testdb.db/brokentable”.  So below is the workaround 
> which will enable to you to continue the development work.
> Current Code:
> val df3 = sqlContext.sql("select * fromtesttable")
> df3.saveAsTable("brokentable")
> THE WORKAROUND:
> By registering the DataFrame as table and then using sql command to load the 
> data will resolve the issue. EX:
> val df3 = sqlContext.sql("select * from testtable").registerTempTable("df3")
> sqlContext.sql("CREATE TABLE brokentable AS SELECT * FROM df3")



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6407) Streaming ALS for Collaborative Filtering

2017-03-05 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15896154#comment-15896154
 ] 

Sean Owen commented on SPARK-6407:
--

Computing one or two iterations per update -- as in every time someone clicks 
on a product or something? no, that's way way too slow. Each would launch tens 
of large distributed jobs.

In practice fold-in works fine. Folding in a day or so of updates has been OK.
The question isn't RMSE but how it affects actual rankings of items in 
recommendations, and it takes a while before the effect of the approximation 
actually changes a rank. 

> Streaming ALS for Collaborative Filtering
> -
>
> Key: SPARK-6407
> URL: https://issues.apache.org/jira/browse/SPARK-6407
> Project: Spark
>  Issue Type: New Feature
>  Components: DStreams
>Reporter: Felix Cheung
>Priority: Minor
>
> Like MLLib's ALS implementation for recommendation, and applying to streaming.
> Similar to streaming linear regression, logistic regression, could we apply 
> gradient updates to batches of data and reuse existing MLLib implementation?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19737) New analysis rule for reporting unregistered functions without relying on relation resolution

2017-03-05 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-19737:
---
Description: 
Let's consider the following simple SQL query that reference an undefined 
function {{foo}} that is never registered in the function registry:
{code:sql}
SELECT foo(a) FROM t
{code}
Assuming table {{t}} is a partitioned  temporary view consisting of a large 
number of files stored on S3, then it may take the analyzer a long time before 
realizing that {{foo}} is not registered yet.

The reason is that the existing analysis rule {{ResolveFunctions}} requires all 
child expressions to be resolved first. Therefore, {{ResolveRelations}} has to 
be executed first to resolve all columns referenced by the unresolved function 
invocation. This further leads to partition discovery for {{t}}, which may take 
a long time.

To address this case, we propose a new lightweight analysis rule 
{{LookupFunctions}} that
# Matches all unresolved function invocations
# Look up the function names from the function registry
# Report analysis error for any unregistered functions

Since this rule doesn't actually try to resolve the unresolved functions, it 
doesn't rely on {{ResolveRelations}} and therefore doesn't trigger partition 
discovery.

We may put this analysis rule in a separate {{Once}} rule batch that sits 
between the "Substitution" batch and the "Resolution" batch to avoid running it 
repeatedly and make sure it gets executed before {{ResolveRelations}}.

  was:
Let's consider the following simple SQL query that reference an invalid 
function {{foo}} that is never registered in the function registry:
{code:sql}
SELECT foo(a) FROM t
{code}
Assuming table {{t}} is a partitioned  temporary view consisting of a large 
number of files stored on S3, then it may take the analyzer a long time before 
realizing that {{foo}} is not registered yet.

The reason is that the existing analysis rule {{ResolveFunctions}} requires all 
child expressions to be resolved first. Therefore, {{ResolveRelations}} has to 
be executed first to resolve all columns referenced by the unresolved function 
invocation. This further leads to partition discovery for {{t}}, which may take 
a long time.

To address this case, we propose a new lightweight analysis rule 
{{LookupFunctions}} that
# Matches all unresolved function invocations
# Look up the function names from the function registry
# Report analysis error for any unregistered functions

Since this rule doesn't actually try to resolve the unresolved functions, it 
doesn't rely on {{ResolveRelations}} and therefore doesn't trigger partition 
discovery.

We may put this analysis rule in a separate {{Once}} rule batch that sits 
between the "Substitution" batch and the "Resolution" batch to avoid running it 
repeatedly and make sure it gets executed before {{ResolveRelations}}.


> New analysis rule for reporting unregistered functions without relying on 
> relation resolution
> -
>
> Key: SPARK-19737
> URL: https://issues.apache.org/jira/browse/SPARK-19737
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Cheng Lian
> Fix For: 2.2.0
>
>
> Let's consider the following simple SQL query that reference an undefined 
> function {{foo}} that is never registered in the function registry:
> {code:sql}
> SELECT foo(a) FROM t
> {code}
> Assuming table {{t}} is a partitioned  temporary view consisting of a large 
> number of files stored on S3, then it may take the analyzer a long time 
> before realizing that {{foo}} is not registered yet.
> The reason is that the existing analysis rule {{ResolveFunctions}} requires 
> all child expressions to be resolved first. Therefore, {{ResolveRelations}} 
> has to be executed first to resolve all columns referenced by the unresolved 
> function invocation. This further leads to partition discovery for {{t}}, 
> which may take a long time.
> To address this case, we propose a new lightweight analysis rule 
> {{LookupFunctions}} that
> # Matches all unresolved function invocations
> # Look up the function names from the function registry
> # Report analysis error for any unregistered functions
> Since this rule doesn't actually try to resolve the unresolved functions, it 
> doesn't rely on {{ResolveRelations}} and therefore doesn't trigger partition 
> discovery.
> We may put this analysis rule in a separate {{Once}} rule batch that sits 
> between the "Substitution" batch and the "Resolution" batch to avoid running 
> it repeatedly and make sure it gets executed before {{ResolveRelations}}.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, 

[jira] [Updated] (SPARK-19737) New analysis rule for reporting unregistered functions without relying on relation resolution

2017-03-05 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-19737:
---
Description: 
Let's consider the following simple SQL query that reference an undefined 
function {{foo}} that is never registered in the function registry:
{code:sql}
SELECT foo(a) FROM t
{code}
Assuming table {{t}} is a partitioned  temporary view consisting of a large 
number of files stored on S3, it may take the analyzer a long time before 
realizing that {{foo}} is not registered yet.

The reason is that the existing analysis rule {{ResolveFunctions}} requires all 
child expressions to be resolved first. Therefore, {{ResolveRelations}} has to 
be executed first to resolve all columns referenced by the unresolved function 
invocation. This further leads to partition discovery for {{t}}, which may take 
a long time.

To address this case, we propose a new lightweight analysis rule 
{{LookupFunctions}} that
# Matches all unresolved function invocations
# Look up the function names from the function registry
# Report analysis error for any unregistered functions

Since this rule doesn't actually try to resolve the unresolved functions, it 
doesn't rely on {{ResolveRelations}} and therefore doesn't trigger partition 
discovery.

We may put this analysis rule in a separate {{Once}} rule batch that sits 
between the "Substitution" batch and the "Resolution" batch to avoid running it 
repeatedly and make sure it gets executed before {{ResolveRelations}}.

  was:
Let's consider the following simple SQL query that reference an undefined 
function {{foo}} that is never registered in the function registry:
{code:sql}
SELECT foo(a) FROM t
{code}
Assuming table {{t}} is a partitioned  temporary view consisting of a large 
number of files stored on S3, then it may take the analyzer a long time before 
realizing that {{foo}} is not registered yet.

The reason is that the existing analysis rule {{ResolveFunctions}} requires all 
child expressions to be resolved first. Therefore, {{ResolveRelations}} has to 
be executed first to resolve all columns referenced by the unresolved function 
invocation. This further leads to partition discovery for {{t}}, which may take 
a long time.

To address this case, we propose a new lightweight analysis rule 
{{LookupFunctions}} that
# Matches all unresolved function invocations
# Look up the function names from the function registry
# Report analysis error for any unregistered functions

Since this rule doesn't actually try to resolve the unresolved functions, it 
doesn't rely on {{ResolveRelations}} and therefore doesn't trigger partition 
discovery.

We may put this analysis rule in a separate {{Once}} rule batch that sits 
between the "Substitution" batch and the "Resolution" batch to avoid running it 
repeatedly and make sure it gets executed before {{ResolveRelations}}.


> New analysis rule for reporting unregistered functions without relying on 
> relation resolution
> -
>
> Key: SPARK-19737
> URL: https://issues.apache.org/jira/browse/SPARK-19737
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Cheng Lian
> Fix For: 2.2.0
>
>
> Let's consider the following simple SQL query that reference an undefined 
> function {{foo}} that is never registered in the function registry:
> {code:sql}
> SELECT foo(a) FROM t
> {code}
> Assuming table {{t}} is a partitioned  temporary view consisting of a large 
> number of files stored on S3, it may take the analyzer a long time before 
> realizing that {{foo}} is not registered yet.
> The reason is that the existing analysis rule {{ResolveFunctions}} requires 
> all child expressions to be resolved first. Therefore, {{ResolveRelations}} 
> has to be executed first to resolve all columns referenced by the unresolved 
> function invocation. This further leads to partition discovery for {{t}}, 
> which may take a long time.
> To address this case, we propose a new lightweight analysis rule 
> {{LookupFunctions}} that
> # Matches all unresolved function invocations
> # Look up the function names from the function registry
> # Report analysis error for any unregistered functions
> Since this rule doesn't actually try to resolve the unresolved functions, it 
> doesn't rely on {{ResolveRelations}} and therefore doesn't trigger partition 
> discovery.
> We may put this analysis rule in a separate {{Once}} rule batch that sits 
> between the "Substitution" batch and the "Resolution" batch to avoid running 
> it repeatedly and make sure it gets executed before {{ResolveRelations}}.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: 

  1   2   >