[jira] [Assigned] (SPARK-19145) Timestamp to String casting is slowing the query significantly
[ https://issues.apache.org/jira/browse/SPARK-19145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19145: Assignee: (was: Apache Spark) > Timestamp to String casting is slowing the query significantly > -- > > Key: SPARK-19145 > URL: https://issues.apache.org/jira/browse/SPARK-19145 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: gagan taneja > > i have a time series table with timestamp column > Following query > SELECT COUNT(*) AS `count` >FROM `default`.`table` >WHERE `time` >= '2017-01-02 19:53:51' > AND `time` <= '2017-01-09 19:53:51' LIMIT 5 > is significantly SLOWER than > SELECT COUNT(*) AS `count` > FROM `default`.`table` > WHERE `time` >= to_utc_timestamp('2017-01-02 19:53:51','-MM-DD > HH24:MI:SS−0800') > AND `time` <= to_utc_timestamp('2017-01-09 19:53:51','-MM-DD > HH24:MI:SS−0800') LIMIT 5 > After investigation i found that in the first query time colum is cast to > String before applying the filter > However in the second query no such casting is performed and its a filter > with long value > Below are the generate Physical plan for slower execution followed by > physical plan for faster execution > SELECT COUNT(*) AS `count` >FROM `default`.`table` >WHERE `time` >= '2017-01-02 19:53:51' > AND `time` <= '2017-01-09 19:53:51' LIMIT 5 > == Physical Plan == > CollectLimit 5 > +- *HashAggregate(keys=[], functions=[count(1)], output=[count#3290L]) >+- Exchange SinglePartition > +- *HashAggregate(keys=[], functions=[partial_count(1)], > output=[count#3339L]) > +- *Project > +- *Filter ((isnotnull(time#3314) && (cast(time#3314 as string) > >= 2017-01-02 19:53:51)) && (cast(time#3314 as string) <= 2017-01-09 > 19:53:51)) >+- *FileScan parquet default.cstat[time#3314] Batched: true, > Format: Parquet, Location: > InMemoryFileIndex[hdfs://10.65.55.220/user/spark/spark-warehouse/cstat], > PartitionFilters: [], PushedFilters: [IsNotNull(time)], ReadSchema: > struct > SELECT COUNT(*) AS `count` > FROM `default`.`table` > WHERE `time` >= to_utc_timestamp('2017-01-02 19:53:51','-MM-DD > HH24:MI:SS−0800') > AND `time` <= to_utc_timestamp('2017-01-09 19:53:51','-MM-DD > HH24:MI:SS−0800') LIMIT 5 > == Physical Plan == > CollectLimit 5 > +- *HashAggregate(keys=[], functions=[count(1)], output=[count#3238L]) >+- Exchange SinglePartition > +- *HashAggregate(keys=[], functions=[partial_count(1)], > output=[count#3287L]) > +- *Project > +- *Filter ((isnotnull(time#3262) && (time#3262 >= > 148340483100)) && (time#3262 <= 148400963100)) >+- *FileScan parquet default.cstat[time#3262] Batched: true, > Format: Parquet, Location: > InMemoryFileIndex[hdfs://10.65.55.220/user/spark/spark-warehouse/cstat], > PartitionFilters: [], PushedFilters: [IsNotNull(time), > GreaterThanOrEqual(time,2017-01-02 19:53:51.0), > LessThanOrEqual(time,2017-01-09..., ReadSchema: struct > In Impala both query run efficiently without and performance difference > Spark should be able to parse the Date string and convert to Long/Timestamp > during generation of Optimized Logical Plan so that both the query would have > similar performance -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19145) Timestamp to String casting is slowing the query significantly
[ https://issues.apache.org/jira/browse/SPARK-19145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19145: Assignee: Apache Spark > Timestamp to String casting is slowing the query significantly > -- > > Key: SPARK-19145 > URL: https://issues.apache.org/jira/browse/SPARK-19145 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: gagan taneja >Assignee: Apache Spark > > i have a time series table with timestamp column > Following query > SELECT COUNT(*) AS `count` >FROM `default`.`table` >WHERE `time` >= '2017-01-02 19:53:51' > AND `time` <= '2017-01-09 19:53:51' LIMIT 5 > is significantly SLOWER than > SELECT COUNT(*) AS `count` > FROM `default`.`table` > WHERE `time` >= to_utc_timestamp('2017-01-02 19:53:51','-MM-DD > HH24:MI:SS−0800') > AND `time` <= to_utc_timestamp('2017-01-09 19:53:51','-MM-DD > HH24:MI:SS−0800') LIMIT 5 > After investigation i found that in the first query time colum is cast to > String before applying the filter > However in the second query no such casting is performed and its a filter > with long value > Below are the generate Physical plan for slower execution followed by > physical plan for faster execution > SELECT COUNT(*) AS `count` >FROM `default`.`table` >WHERE `time` >= '2017-01-02 19:53:51' > AND `time` <= '2017-01-09 19:53:51' LIMIT 5 > == Physical Plan == > CollectLimit 5 > +- *HashAggregate(keys=[], functions=[count(1)], output=[count#3290L]) >+- Exchange SinglePartition > +- *HashAggregate(keys=[], functions=[partial_count(1)], > output=[count#3339L]) > +- *Project > +- *Filter ((isnotnull(time#3314) && (cast(time#3314 as string) > >= 2017-01-02 19:53:51)) && (cast(time#3314 as string) <= 2017-01-09 > 19:53:51)) >+- *FileScan parquet default.cstat[time#3314] Batched: true, > Format: Parquet, Location: > InMemoryFileIndex[hdfs://10.65.55.220/user/spark/spark-warehouse/cstat], > PartitionFilters: [], PushedFilters: [IsNotNull(time)], ReadSchema: > struct > SELECT COUNT(*) AS `count` > FROM `default`.`table` > WHERE `time` >= to_utc_timestamp('2017-01-02 19:53:51','-MM-DD > HH24:MI:SS−0800') > AND `time` <= to_utc_timestamp('2017-01-09 19:53:51','-MM-DD > HH24:MI:SS−0800') LIMIT 5 > == Physical Plan == > CollectLimit 5 > +- *HashAggregate(keys=[], functions=[count(1)], output=[count#3238L]) >+- Exchange SinglePartition > +- *HashAggregate(keys=[], functions=[partial_count(1)], > output=[count#3287L]) > +- *Project > +- *Filter ((isnotnull(time#3262) && (time#3262 >= > 148340483100)) && (time#3262 <= 148400963100)) >+- *FileScan parquet default.cstat[time#3262] Batched: true, > Format: Parquet, Location: > InMemoryFileIndex[hdfs://10.65.55.220/user/spark/spark-warehouse/cstat], > PartitionFilters: [], PushedFilters: [IsNotNull(time), > GreaterThanOrEqual(time,2017-01-02 19:53:51.0), > LessThanOrEqual(time,2017-01-09..., ReadSchema: struct > In Impala both query run efficiently without and performance difference > Spark should be able to parse the Date string and convert to Long/Timestamp > during generation of Optimized Logical Plan so that both the query would have > similar performance -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19145) Timestamp to String casting is slowing the query significantly
[ https://issues.apache.org/jira/browse/SPARK-19145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15896842#comment-15896842 ] Apache Spark commented on SPARK-19145: -- User 'tanejagagan' has created a pull request for this issue: https://github.com/apache/spark/pull/17174 > Timestamp to String casting is slowing the query significantly > -- > > Key: SPARK-19145 > URL: https://issues.apache.org/jira/browse/SPARK-19145 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: gagan taneja > > i have a time series table with timestamp column > Following query > SELECT COUNT(*) AS `count` >FROM `default`.`table` >WHERE `time` >= '2017-01-02 19:53:51' > AND `time` <= '2017-01-09 19:53:51' LIMIT 5 > is significantly SLOWER than > SELECT COUNT(*) AS `count` > FROM `default`.`table` > WHERE `time` >= to_utc_timestamp('2017-01-02 19:53:51','-MM-DD > HH24:MI:SS−0800') > AND `time` <= to_utc_timestamp('2017-01-09 19:53:51','-MM-DD > HH24:MI:SS−0800') LIMIT 5 > After investigation i found that in the first query time colum is cast to > String before applying the filter > However in the second query no such casting is performed and its a filter > with long value > Below are the generate Physical plan for slower execution followed by > physical plan for faster execution > SELECT COUNT(*) AS `count` >FROM `default`.`table` >WHERE `time` >= '2017-01-02 19:53:51' > AND `time` <= '2017-01-09 19:53:51' LIMIT 5 > == Physical Plan == > CollectLimit 5 > +- *HashAggregate(keys=[], functions=[count(1)], output=[count#3290L]) >+- Exchange SinglePartition > +- *HashAggregate(keys=[], functions=[partial_count(1)], > output=[count#3339L]) > +- *Project > +- *Filter ((isnotnull(time#3314) && (cast(time#3314 as string) > >= 2017-01-02 19:53:51)) && (cast(time#3314 as string) <= 2017-01-09 > 19:53:51)) >+- *FileScan parquet default.cstat[time#3314] Batched: true, > Format: Parquet, Location: > InMemoryFileIndex[hdfs://10.65.55.220/user/spark/spark-warehouse/cstat], > PartitionFilters: [], PushedFilters: [IsNotNull(time)], ReadSchema: > struct > SELECT COUNT(*) AS `count` > FROM `default`.`table` > WHERE `time` >= to_utc_timestamp('2017-01-02 19:53:51','-MM-DD > HH24:MI:SS−0800') > AND `time` <= to_utc_timestamp('2017-01-09 19:53:51','-MM-DD > HH24:MI:SS−0800') LIMIT 5 > == Physical Plan == > CollectLimit 5 > +- *HashAggregate(keys=[], functions=[count(1)], output=[count#3238L]) >+- Exchange SinglePartition > +- *HashAggregate(keys=[], functions=[partial_count(1)], > output=[count#3287L]) > +- *Project > +- *Filter ((isnotnull(time#3262) && (time#3262 >= > 148340483100)) && (time#3262 <= 148400963100)) >+- *FileScan parquet default.cstat[time#3262] Batched: true, > Format: Parquet, Location: > InMemoryFileIndex[hdfs://10.65.55.220/user/spark/spark-warehouse/cstat], > PartitionFilters: [], PushedFilters: [IsNotNull(time), > GreaterThanOrEqual(time,2017-01-02 19:53:51.0), > LessThanOrEqual(time,2017-01-09..., ReadSchema: struct > In Impala both query run efficiently without and performance difference > Spark should be able to parse the Date string and convert to Long/Timestamp > during generation of Optimized Logical Plan so that both the query would have > similar performance -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19832) DynamicPartitionWriteTask should escape the partition name
[ https://issues.apache.org/jira/browse/SPARK-19832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19832: Assignee: (was: Apache Spark) > DynamicPartitionWriteTask should escape the partition name > --- > > Key: SPARK-19832 > URL: https://issues.apache.org/jira/browse/SPARK-19832 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Song Jun > > Currently in DynamicPartitionWriteTask, when we get the paritionPath of a > parition, we just escape the partition value, not escape the partition name. > this will cause some problems for some special partition name situation, for > example : > 1) if the partition name contains '%' etc, there will be two partition path > created in the filesytem, one is for escaped path like '/path/a%25b=1', > another is for unescaped path like '/path/a%b=1'. > and the data inserted stored in unescaped path, while the show partitions > table will return 'a%25b=1' which the partition name is escaped. So here it > is not consist. And I think the data should be stored in the escaped path in > filesystem, which Hive2.0.0 also have the same action. > 2) if the partition name contains ':', there will throw exception that new > Path("/path","a:b"), this is illegal which has a colon in the relative path. > {code} > java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative > path in absolute URI: a:b > at org.apache.hadoop.fs.Path.initialize(Path.java:205) > at org.apache.hadoop.fs.Path.(Path.java:171) > at org.apache.hadoop.fs.Path.(Path.java:88) > ... 48 elided > Caused by: java.net.URISyntaxException: Relative path in absolute URI: a:b > at java.net.URI.checkPath(URI.java:1823) > at java.net.URI.(URI.java:745) > at org.apache.hadoop.fs.Path.initialize(Path.java:202) > ... 50 more > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19832) DynamicPartitionWriteTask should escape the partition name
[ https://issues.apache.org/jira/browse/SPARK-19832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15896813#comment-15896813 ] Apache Spark commented on SPARK-19832: -- User 'windpiger' has created a pull request for this issue: https://github.com/apache/spark/pull/17173 > DynamicPartitionWriteTask should escape the partition name > --- > > Key: SPARK-19832 > URL: https://issues.apache.org/jira/browse/SPARK-19832 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Song Jun > > Currently in DynamicPartitionWriteTask, when we get the paritionPath of a > parition, we just escape the partition value, not escape the partition name. > this will cause some problems for some special partition name situation, for > example : > 1) if the partition name contains '%' etc, there will be two partition path > created in the filesytem, one is for escaped path like '/path/a%25b=1', > another is for unescaped path like '/path/a%b=1'. > and the data inserted stored in unescaped path, while the show partitions > table will return 'a%25b=1' which the partition name is escaped. So here it > is not consist. And I think the data should be stored in the escaped path in > filesystem, which Hive2.0.0 also have the same action. > 2) if the partition name contains ':', there will throw exception that new > Path("/path","a:b"), this is illegal which has a colon in the relative path. > {code} > java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative > path in absolute URI: a:b > at org.apache.hadoop.fs.Path.initialize(Path.java:205) > at org.apache.hadoop.fs.Path.(Path.java:171) > at org.apache.hadoop.fs.Path.(Path.java:88) > ... 48 elided > Caused by: java.net.URISyntaxException: Relative path in absolute URI: a:b > at java.net.URI.checkPath(URI.java:1823) > at java.net.URI.(URI.java:745) > at org.apache.hadoop.fs.Path.initialize(Path.java:202) > ... 50 more > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19832) DynamicPartitionWriteTask should escape the partition name
[ https://issues.apache.org/jira/browse/SPARK-19832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19832: Assignee: Apache Spark > DynamicPartitionWriteTask should escape the partition name > --- > > Key: SPARK-19832 > URL: https://issues.apache.org/jira/browse/SPARK-19832 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Song Jun >Assignee: Apache Spark > > Currently in DynamicPartitionWriteTask, when we get the paritionPath of a > parition, we just escape the partition value, not escape the partition name. > this will cause some problems for some special partition name situation, for > example : > 1) if the partition name contains '%' etc, there will be two partition path > created in the filesytem, one is for escaped path like '/path/a%25b=1', > another is for unescaped path like '/path/a%b=1'. > and the data inserted stored in unescaped path, while the show partitions > table will return 'a%25b=1' which the partition name is escaped. So here it > is not consist. And I think the data should be stored in the escaped path in > filesystem, which Hive2.0.0 also have the same action. > 2) if the partition name contains ':', there will throw exception that new > Path("/path","a:b"), this is illegal which has a colon in the relative path. > {code} > java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative > path in absolute URI: a:b > at org.apache.hadoop.fs.Path.initialize(Path.java:205) > at org.apache.hadoop.fs.Path.(Path.java:171) > at org.apache.hadoop.fs.Path.(Path.java:88) > ... 48 elided > Caused by: java.net.URISyntaxException: Relative path in absolute URI: a:b > at java.net.URI.checkPath(URI.java:1823) > at java.net.URI.(URI.java:745) > at org.apache.hadoop.fs.Path.initialize(Path.java:202) > ... 50 more > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19832) DynamicPartitionWriteTask should escape the partition name
Song Jun created SPARK-19832: Summary: DynamicPartitionWriteTask should escape the partition name Key: SPARK-19832 URL: https://issues.apache.org/jira/browse/SPARK-19832 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.2.0 Reporter: Song Jun Currently in DynamicPartitionWriteTask, when we get the paritionPath of a parition, we just escape the partition value, not escape the partition name. this will cause some problems for some special partition name situation, for example : 1) if the partition name contains '%' etc, there will be two partition path created in the filesytem, one is for escaped path like '/path/a%25b=1', another is for unescaped path like '/path/a%b=1'. and the data inserted stored in unescaped path, while the show partitions table will return 'a%25b=1' which the partition name is escaped. So here it is not consist. And I think the data should be stored in the escaped path in filesystem, which Hive2.0.0 also have the same action. 2) if the partition name contains ':', there will throw exception that new Path("/path","a:b"), this is illegal which has a colon in the relative path. {code} java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: a:b at org.apache.hadoop.fs.Path.initialize(Path.java:205) at org.apache.hadoop.fs.Path.(Path.java:171) at org.apache.hadoop.fs.Path.(Path.java:88) ... 48 elided Caused by: java.net.URISyntaxException: Relative path in absolute URI: a:b at java.net.URI.checkPath(URI.java:1823) at java.net.URI.(URI.java:745) at org.apache.hadoop.fs.Path.initialize(Path.java:202) ... 50 more {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19008) Avoid boxing/unboxing overhead of calling a lambda with primitive type from Dataset program
[ https://issues.apache.org/jira/browse/SPARK-19008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15896808#comment-15896808 ] Apache Spark commented on SPARK-19008: -- User 'kiszk' has created a pull request for this issue: https://github.com/apache/spark/pull/17172 > Avoid boxing/unboxing overhead of calling a lambda with primitive type from > Dataset program > --- > > Key: SPARK-19008 > URL: https://issues.apache.org/jira/browse/SPARK-19008 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Kazuaki Ishizaki > > In a > [discussion|https://github.com/apache/spark/pull/16391#discussion_r93788919] > between [~cloud_fan] and [~kiszk], we noticed an opportunity to avoid > boxing/unboxing overhead when a Dataset program calls a lambda, which > operates on a primitive type, written in Scala. > In such a case, Catalyst can directly call a method {{ > apply();}} instead of {{Object apply(Object);}}. > Of course, the best solution seems to be > [here|https://issues.apache.org/jira/browse/SPARK-14083]. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19008) Avoid boxing/unboxing overhead of calling a lambda with primitive type from Dataset program
[ https://issues.apache.org/jira/browse/SPARK-19008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19008: Assignee: (was: Apache Spark) > Avoid boxing/unboxing overhead of calling a lambda with primitive type from > Dataset program > --- > > Key: SPARK-19008 > URL: https://issues.apache.org/jira/browse/SPARK-19008 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Kazuaki Ishizaki > > In a > [discussion|https://github.com/apache/spark/pull/16391#discussion_r93788919] > between [~cloud_fan] and [~kiszk], we noticed an opportunity to avoid > boxing/unboxing overhead when a Dataset program calls a lambda, which > operates on a primitive type, written in Scala. > In such a case, Catalyst can directly call a method {{ > apply();}} instead of {{Object apply(Object);}}. > Of course, the best solution seems to be > [here|https://issues.apache.org/jira/browse/SPARK-14083]. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19008) Avoid boxing/unboxing overhead of calling a lambda with primitive type from Dataset program
[ https://issues.apache.org/jira/browse/SPARK-19008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19008: Assignee: Apache Spark > Avoid boxing/unboxing overhead of calling a lambda with primitive type from > Dataset program > --- > > Key: SPARK-19008 > URL: https://issues.apache.org/jira/browse/SPARK-19008 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Kazuaki Ishizaki >Assignee: Apache Spark > > In a > [discussion|https://github.com/apache/spark/pull/16391#discussion_r93788919] > between [~cloud_fan] and [~kiszk], we noticed an opportunity to avoid > boxing/unboxing overhead when a Dataset program calls a lambda, which > operates on a primitive type, written in Scala. > In such a case, Catalyst can directly call a method {{ > apply();}} instead of {{Object apply(Object);}}. > Of course, the best solution seems to be > [here|https://issues.apache.org/jira/browse/SPARK-14083]. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19831) Sending the heartbeat master from worker maybe blocked by other rpc messages
[ https://issues.apache.org/jira/browse/SPARK-19831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hustfxj updated SPARK-19831: Description: Cleaning the application may cost much time at worker, then it will block that the worker send heartbeats master because the worker is extend *ThreadSafeRpcEndpoint*. If the heartbeat from a worker is blocked by the message *ApplicationFinished*, master will think the worker is dead. If the worker has a driver, the driver will be scheduled by master again. So I think it is the bug on spark. It may solve this problem by the followed suggests: 1. It had better put the cleaning the application in a single asynchronous thread like 'cleanupThreadExecutor'. Thus it won't block other rpc messages like *SendHeartbeat*; 2. It had better not send the heartbeat master by Rpc channel. Because any other rpc message may block the rpc channel. It had better send the heartbeat master at an asynchronous timing thread . was: Cleaning the application may cost much time at worker, then it will block that the worker send heartbeats master because the worker is extend *ThreadSafeRpcEndpoint*. If the heartbeat from a worker is blocked by the message *ApplicationFinished*, master will think the worker is dead. If the worker has a driver, the driver will be scheduled by master again. So I think it is the bug on spark. It may solve this problem by the followed suggests: 1. It had better put the cleaning the application in a single asynchronous thread like 'cleanupThreadExecutor'. Thus it won't block other rpc messages like SendHeartbeat; 2. It had better not send the heartbeat master by rpc channel. Because any other rpc message may block the rpc channel. It had better send the heartbeat master at an asynchronous timing thread . > Sending the heartbeat master from worker maybe blocked by other rpc messages > -- > > Key: SPARK-19831 > URL: https://issues.apache.org/jira/browse/SPARK-19831 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: hustfxj > > Cleaning the application may cost much time at worker, then it will block > that the worker send heartbeats master because the worker is extend > *ThreadSafeRpcEndpoint*. If the heartbeat from a worker is blocked by the > message *ApplicationFinished*, master will think the worker is dead. If the > worker has a driver, the driver will be scheduled by master again. So I think > it is the bug on spark. It may solve this problem by the followed suggests: > 1. It had better put the cleaning the application in a single asynchronous > thread like 'cleanupThreadExecutor'. Thus it won't block other rpc messages > like *SendHeartbeat*; > 2. It had better not send the heartbeat master by Rpc channel. Because any > other rpc message may block the rpc channel. It had better send the heartbeat > master at an asynchronous timing thread . -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19831) Sending the heartbeat master from worker maybe blocked by other rpc messages
[ https://issues.apache.org/jira/browse/SPARK-19831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hustfxj updated SPARK-19831: Description: Cleaning the application may cost much time at worker, then it will block that the worker send heartbeats master because the worker is extend *ThreadSafeRpcEndpoint*. If the heartbeat from a worker is blocked by the message *ApplicationFinished*, master will think the worker is dead. If the worker has a driver, the driver will be scheduled by master again. So I think it is the bug on spark. It may solve this problem by the followed suggests: 1. It had better put the cleaning the application in a single asynchronous thread like 'cleanupThreadExecutor'. Thus it won't block other rpc messages like SendHeartbeat; 2. It had better not send the heartbeat master by rpc channel. Because any other rpc message may block the rpc channel. It had better send the heartbeat master at an asynchronous timing thread . was: Cleaning the application may cost much time at worker, then it will block that the worker send heartbeats master because the worker is extend *ThreadSafeRpcEndpoint*. If the heartbeat from a worker is blocked by the message *ApplicationFinished*, master will think the worker is dead. If the worker has a driver, the driver will be scheduled by master again. So I think it is the bug on spark. I can solve this problem by followed suggests: 1. It had better put the cleaning the application in a single asynchronous thread like 'cleanupThreadExecutor'. Thus it won't block other rpc messages like SendHeartbeat; 2. It had better not send the heartbeat master by rpc channel. Because any other rpc message may block the rpc channel. It had better send the heartbeat master at an asynchronous timing thread . > Sending the heartbeat master from worker maybe blocked by other rpc messages > -- > > Key: SPARK-19831 > URL: https://issues.apache.org/jira/browse/SPARK-19831 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: hustfxj > > Cleaning the application may cost much time at worker, then it will block > that the worker send heartbeats master because the worker is extend > *ThreadSafeRpcEndpoint*. If the heartbeat from a worker is blocked by the > message *ApplicationFinished*, master will think the worker is dead. If the > worker has a driver, the driver will be scheduled by master again. So I think > it is the bug on spark. It may solve this problem by the followed suggests: > 1. It had better put the cleaning the application in a single asynchronous > thread like 'cleanupThreadExecutor'. Thus it won't block other rpc messages > like SendHeartbeat; > 2. It had better not send the heartbeat master by rpc channel. Because any > other rpc message may block the rpc channel. It had better send the heartbeat > master at an asynchronous timing thread . -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19831) Sending the heartbeat master from worker maybe blocked by other rpc messages
[ https://issues.apache.org/jira/browse/SPARK-19831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hustfxj updated SPARK-19831: Description: Cleaning the application may cost much time at worker, then it will block that the worker send heartbeats master because the worker is extend *ThreadSafeRpcEndpoint*. If the heartbeat from a worker is blocked by the message *ApplicationFinished*, master will think the worker is dead. If the worker has a driver, the driver will be scheduled by master again. So I think it is the bug on spark. I can solve this problem by followed suggests: 1. It had better put the cleaning the application in a single asynchronous thread like 'cleanupThreadExecutor'. Thus it won't block other rpc messages like SendHeartbeat; 2. It had better not send the heartbeat master by rpc channel. Because any other rpc message may block the rpc channel. It had better send the heartbeat master at an asynchronous timing thread . was: Cleaning the application may cost much time at worker, then it will block that the worker send heartbeats master and rpc messages because the worker is extend *ThreadSafeRpcEndpoint*. If the heartbeat from a worker is blocked by the message *ApplicationFinished*, master will think the worker is dead. If the worker has a driver, the driver will be scheduled by master again. So I think it is the bug on spark. I can solve this problem by followed suggests: 1. It had better put the cleaning the application in a single asynchronous thread like 'cleanupThreadExecutor'. Thus it won't block other rpc messages like SendHeartbeat; 2. It had better not send the heartbeat master by rpc channel. Because any other rpc message may block the rpc channel. It had better send the heartbeat master at an asynchronous timing thread . > Sending the heartbeat master from worker maybe blocked by other rpc messages > -- > > Key: SPARK-19831 > URL: https://issues.apache.org/jira/browse/SPARK-19831 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: hustfxj > > Cleaning the application may cost much time at worker, then it will block > that the worker send heartbeats master because the worker is extend > *ThreadSafeRpcEndpoint*. If the heartbeat from a worker is blocked by the > message *ApplicationFinished*, master will think the worker is dead. If the > worker has a driver, the driver will be scheduled by master again. So I think > it is the bug on spark. I can solve this problem by followed suggests: > 1. It had better put the cleaning the application in a single asynchronous > thread like 'cleanupThreadExecutor'. Thus it won't block other rpc messages > like SendHeartbeat; > 2. It had better not send the heartbeat master by rpc channel. Because any > other rpc message may block the rpc channel. It had better send the heartbeat > master at an asynchronous timing thread . -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19831) Sending the heartbeat master from worker maybe blocked by other rpc messages
[ https://issues.apache.org/jira/browse/SPARK-19831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hustfxj updated SPARK-19831: Description: Cleaning the application may cost much time at worker, then it will block that the worker send heartbeats master and rpc messages because the worker is extend *ThreadSafeRpcEndpoint*. If the heartbeat from a worker is blocked by the message *ApplicationFinished*, master will think the worker is dead. If the worker has a driver, the driver will be scheduled by master again. So I think it is the bug on spark. I can solve this problem by followed suggests: 1. It had better put the cleaning the application in a single asynchronous thread like 'cleanupThreadExecutor'. Thus it won't block other rpc messages like SendHeartbeat; 2. It had better not send the heartbeat master by rpc channel. Because any other rpc message may block the rpc channel. It had better send the heartbeat master at an asynchronous timing thread . was: Cleaning the application may cost much time at worker, then it will block that the worker send heartbeats master and rpc messages because the worker is extend *ThreadSafeRpcEndpoint*. So the master will think the worker is dead. If the worker has a driver, the driver will be scheduled by master again. So I think it is the bug on spark. I can solve this problem by followed suggests: 1. It had better put the cleaning the application in a single asynchronous thread like 'cleanupThreadExecutor'. Thus it won't block other rpc messages like SendHeartbeat; 2. It had better not send the heartbeat master by rpc channel. Because any other rpc message may block the rpc channel. It had better send the heartbeat master at an asynchronous timing thread . > Sending the heartbeat master from worker maybe blocked by other rpc messages > -- > > Key: SPARK-19831 > URL: https://issues.apache.org/jira/browse/SPARK-19831 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: hustfxj > > Cleaning the application may cost much time at worker, then it will block > that the worker send heartbeats master and rpc messages because the worker > is extend *ThreadSafeRpcEndpoint*. If the heartbeat from a worker is blocked > by the message *ApplicationFinished*, master will think the worker is dead. > If the worker has a driver, the driver will be scheduled by master again. So > I think it is the bug on spark. I can solve this problem by followed suggests: > 1. It had better put the cleaning the application in a single asynchronous > thread like 'cleanupThreadExecutor'. Thus it won't block other rpc messages > like SendHeartbeat; > 2. It had better not send the heartbeat master by rpc channel. Because any > other rpc message may block the rpc channel. It had better send the heartbeat > master at an asynchronous timing thread . -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19831) Sending the heartbeat master from worker maybe blocked by other rpc messages
[ https://issues.apache.org/jira/browse/SPARK-19831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hustfxj updated SPARK-19831: Summary: Sending the heartbeat master from worker maybe blocked by other rpc messages (was: Sending the heartbeat to master maybe blocked by other rpc messages) > Sending the heartbeat master from worker maybe blocked by other rpc messages > -- > > Key: SPARK-19831 > URL: https://issues.apache.org/jira/browse/SPARK-19831 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: hustfxj > > Cleaning the application may cost much time at worker, then it will block > that the worker send heartbeats master and rpc messages because the worker > is extend *ThreadSafeRpcEndpoint*. So the master will think the worker is > dead. If the worker has a driver, the driver will be scheduled by master > again. So I think it is the bug on spark. I can solve this problem by > followed suggests: > 1. It had better put the cleaning the application in a single asynchronous > thread like 'cleanupThreadExecutor'. Thus it won't block other rpc messages > like SendHeartbeat; > 2. It had better not send the heartbeat master by rpc channel. Because any > other rpc message may block the rpc channel. It had better send the heartbeat > master at an asynchronous timing thread . -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19831) Sending the heartbeat to master maybe blocked by other rpc messages
hustfxj created SPARK-19831: --- Summary: Sending the heartbeat to master maybe blocked by other rpc messages Key: SPARK-19831 URL: https://issues.apache.org/jira/browse/SPARK-19831 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.2.0 Reporter: hustfxj Cleaning the application may cost much time at worker, then it will block that the worker send heartbeats master and rpc messages because the worker is extend *ThreadSafeRpcEndpoint*. So the master will think the worker is dead. If the worker has a driver, the driver will be scheduled by master again. So I think it is the bug on spark. I can solve this problem by followed suggests: 1. It had better put the cleaning the application in a single asynchronous thread like 'cleanupThreadExecutor'. Thus it won't block other rpc messages like SendHeartbeat; 2. It had better not send the heartbeat master by rpc channel. Because any other rpc message may block the rpc channel. It had better send the heartbeat master at an asynchronous timing thread . -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19578) Poor pyspark performance + incorrect UI input-size metrics
[ https://issues.apache.org/jira/browse/SPARK-19578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15896751#comment-15896751 ] holdenk commented on SPARK-19578: - [~nchammas] That sounds like a pretty good summary from my point of view. Of course its possible there are some other performance wins we could find (and it might be worth thinking about), but I think our current plan is to focus on improving DataFrame performance for PySpark. > Poor pyspark performance + incorrect UI input-size metrics > -- > > Key: SPARK-19578 > URL: https://issues.apache.org/jira/browse/SPARK-19578 > Project: Spark > Issue Type: Bug > Components: PySpark, Web UI >Affects Versions: 1.6.1, 1.6.2, 2.0.1 > Environment: Spark 1.6.2 Hortonworks > Spark 2.0.1 MapR > Spark 1.6.1 MapR >Reporter: Artur Sukhenko > Attachments: pyspark_incorrect_inputsize.png, reproduce_log, > spark_shell_correct_inputsize.png > > > Simple job in pyspark takes 14 minutes to complete. > The text file used to reproduce contains multiple millions lines of one word > "yes" > (it might be the cause of poor performance) > {code} > var a = sc.textFile("/tmp/yes.txt") > a.count() > {code} > Same code took 33 sec in spark-shell > Reproduce steps: > Run this to generate big file (press Ctrl+C after 5-6 seconds) > [spark@c6401 ~]$ yes > /tmp/yes.txt > [spark@c6401 ~]$ ll /tmp/ > -rw-r--r-- 1 spark hadoop 516079616 Feb 13 11:10 yes.txt > [spark@c6401 ~]$ hadoop fs -copyFromLocal /tmp/yes.txt /tmp/ > [spark@c6401 ~]$ pyspark > {code} > Python 2.6.6 (r266:84292, Feb 22 2013, 00:00:18) > [GCC 4.4.7 20120313 (Red Hat 4.4.7-3)] on linux2 > 17/02/13 11:12:36 INFO SparkContext: Running Spark version 1.6.2 > {code} > >>> a = sc.textFile("/tmp/yes.txt") > {code} > 17/02/13 11:12:58 INFO MemoryStore: Block broadcast_0 stored as values in > memory (estimated size 341.1 KB, free 341.1 KB) > 17/02/13 11:12:58 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes > in memory (estimated size 28.3 KB, free 369.4 KB) > 17/02/13 11:12:58 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory > on localhost:43389 (size: 28.3 KB, free: 517.4 MB) > 17/02/13 11:12:58 INFO SparkContext: Created broadcast 0 from textFile at > NativeMethodAccessorImpl.java:-2 > {code} > >>> a.count() > {code} > 17/02/13 11:13:03 INFO FileInputFormat: Total input paths to process : 1 > 17/02/13 11:13:03 INFO SparkContext: Starting job: count at :1 > 17/02/13 11:13:03 INFO DAGScheduler: Got job 0 (count at :1) with 4 > output partitions > 17/02/13 11:13:03 INFO DAGScheduler: Final stage: ResultStage 0 (count at > :1) > 17/02/13 11:13:03 INFO DAGScheduler: Parents of final stage: List() > 17/02/13 11:13:03 INFO DAGScheduler: Missing parents: List() > 17/02/13 11:13:03 INFO DAGScheduler: Submitting ResultStage 0 (PythonRDD[2] > at count at :1), which has no missing parents > 17/02/13 11:13:03 INFO MemoryStore: Block broadcast_1 stored as values in > memory (estimated size 5.7 KB, free 375.1 KB) > 17/02/13 11:13:03 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes > in memory (estimated size 3.5 KB, free 378.6 KB) > 17/02/13 11:13:03 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory > on localhost:43389 (size: 3.5 KB, free: 517.4 MB) > 17/02/13 11:13:03 INFO SparkContext: Created broadcast 1 from broadcast at > DAGScheduler.scala:1008 > 17/02/13 11:13:03 INFO DAGScheduler: Submitting 4 missing tasks from > ResultStage 0 (PythonRDD[2] at count at :1) > 17/02/13 11:13:03 INFO TaskSchedulerImpl: Adding task set 0.0 with 4 tasks > 17/02/13 11:13:03 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, > localhost, partition 0,ANY, 2149 bytes) > 17/02/13 11:13:03 INFO Executor: Running task 0.0 in stage 0.0 (TID 0) > 17/02/13 11:13:03 INFO HadoopRDD: Input split: > hdfs://c6401.ambari.apache.org:8020/tmp/yes.txt:0+134217728 > 17/02/13 11:13:03 INFO deprecation: mapred.tip.id is deprecated. Instead, use > mapreduce.task.id > 17/02/13 11:13:03 INFO deprecation: mapred.task.id is deprecated. Instead, > use mapreduce.task.attempt.id > 17/02/13 11:13:03 INFO deprecation: mapred.task.is.map is deprecated. > Instead, use mapreduce.task.ismap > 17/02/13 11:13:03 INFO deprecation: mapred.task.partition is deprecated. > Instead, use mapreduce.task.partition > 17/02/13 11:13:03 INFO deprecation: mapred.job.id is deprecated. Instead, use > mapreduce.job.id > 17/02/13 11:16:37 INFO PythonRunner: Times: total = 213335, boot = 317, init > = 445, finish = 212573 > 17/02/13 11:16:37 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 2182 > bytes result sent to driver > 17/02/13 11:16:37 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, > localhost, partition 1,ANY, 2149 bytes) > 17/02/13 11:16:37 INFO Executor: Running task 1.0 in stage 0.0
[jira] [Commented] (SPARK-19067) mapGroupsWithState - arbitrary stateful operations with Structured Streaming (similar to DStream.mapWithState)
[ https://issues.apache.org/jira/browse/SPARK-19067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15896741#comment-15896741 ] Tathagata Das commented on SPARK-19067: --- Hey [~amitsela] Apologies for not noticing this comment earlier. I am in the process of adding timeouts to mapGroupsWithState. Will update the JIRA soon. Let me know how much that helps. > mapGroupsWithState - arbitrary stateful operations with Structured Streaming > (similar to DStream.mapWithState) > -- > > Key: SPARK-19067 > URL: https://issues.apache.org/jira/browse/SPARK-19067 > Project: Spark > Issue Type: New Feature > Components: Structured Streaming >Reporter: Michael Armbrust >Assignee: Tathagata Das >Priority: Critical > > Right now the only way to do stateful operations with with Aggregator or > UDAF. However, this does not give users control of emission or expiration of > state making it hard to implement things like sessionization. We should add > a more general construct (probably similar to {{DStream.mapWithState}}) to > structured streaming. Here is the design. > *Requirements* > - Users should be able to specify a function that can do the following > - Access the input row corresponding to a key > - Access the previous state corresponding to a key > - Optionally, update or remove the state > - Output any number of new rows (or none at all) > *Proposed API* > {code} > // New methods on KeyValueGroupedDataset > class KeyValueGroupedDataset[K, V] { > // Scala friendly > def mapGroupsWithState[S: Encoder, U: Encoder](func: (K, Iterator[V], > State[S]) => U) > def flatMapGroupsWithState[S: Encode, U: Encoder](func: (K, > Iterator[V], State[S]) => Iterator[U]) > // Java friendly >def mapGroupsWithState[S, U](func: MapGroupsWithStateFunction[K, V, S, > R], stateEncoder: Encoder[S], resultEncoder: Encoder[U]) >def flatMapGroupsWithState[S, U](func: > FlatMapGroupsWithStateFunction[K, V, S, R], stateEncoder: Encoder[S], > resultEncoder: Encoder[U]) > } > // --- New Java-friendly function classes --- > public interface MapGroupsWithStateFunctionextends Serializable { > R call(K key, Iterator values, state: State) throws Exception; > } > public interface FlatMapGroupsWithStateFunction extends > Serializable { > Iterator call(K key, Iterator values, state: State) throws > Exception; > } > // -- Wrapper class for state data -- > trait KeyedState[S] { > def exists(): Boolean > def get(): S// throws Exception is state does not > exist > def getOption(): Option[S] > def update(newState: S): Unit > def remove(): Unit // exists() will be false after this > } > {code} > Key Semantics of the State class > - The state can be null. > - If the state.remove() is called, then state.exists() will return false, and > getOption will returm None. > - After that state.update(newState) is called, then state.exists() will > return true, and getOption will return Some(...). > - None of the operations are thread-safe. This is to avoid memory barriers. > *Usage* > {code} > val stateFunc = (word: String, words: Iterator[String, runningCount: > KeyedState[Long]) => { > val newCount = words.size + runningCount.getOption.getOrElse(0L) > runningCount.update(newCount) >(word, newCount) > } > dataset // type > is Dataset[String] > .groupByKey[String](w => w) // generates > KeyValueGroupedDataset[String, String] > .mapGroupsWithState[Long, (String, Long)](stateFunc)// returns > Dataset[(String, Long)] > {code} > *Future Directions* > - Timeout based state expiration (that has not received data for a while) > - General expression based expiration -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19830) Add parseTableSchema API to ParserInterface
[ https://issues.apache.org/jira/browse/SPARK-19830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19830: Assignee: Apache Spark (was: Xiao Li) > Add parseTableSchema API to ParserInterface > --- > > Key: SPARK-19830 > URL: https://issues.apache.org/jira/browse/SPARK-19830 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Xiao Li >Assignee: Apache Spark > > Specifying the table schema in DDL formats is needed for different scenarios. > For example, specifying the schema in SQL function {{from_json}}, and > specifying the customized JDBC data types. In the submitted PRs, the idea is > to ask users to specify the table schema in the JSON format. This is not user > friendly. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19830) Add parseTableSchema API to ParserInterface
[ https://issues.apache.org/jira/browse/SPARK-19830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19830: Assignee: Xiao Li (was: Apache Spark) > Add parseTableSchema API to ParserInterface > --- > > Key: SPARK-19830 > URL: https://issues.apache.org/jira/browse/SPARK-19830 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Xiao Li >Assignee: Xiao Li > > Specifying the table schema in DDL formats is needed for different scenarios. > For example, specifying the schema in SQL function {{from_json}}, and > specifying the customized JDBC data types. In the submitted PRs, the idea is > to ask users to specify the table schema in the JSON format. This is not user > friendly. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19830) Add parseTableSchema API to ParserInterface
[ https://issues.apache.org/jira/browse/SPARK-19830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15896736#comment-15896736 ] Apache Spark commented on SPARK-19830: -- User 'gatorsmile' has created a pull request for this issue: https://github.com/apache/spark/pull/17171 > Add parseTableSchema API to ParserInterface > --- > > Key: SPARK-19830 > URL: https://issues.apache.org/jira/browse/SPARK-19830 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Xiao Li >Assignee: Xiao Li > > Specifying the table schema in DDL formats is needed for different scenarios. > For example, specifying the schema in SQL function {{from_json}}, and > specifying the customized JDBC data types. In the submitted PRs, the idea is > to ask users to specify the table schema in the JSON format. This is not user > friendly. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19830) Add parseTableSchema API to ParserInterface
[ https://issues.apache.org/jira/browse/SPARK-19830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-19830: Description: Specifying the table schema in DDL formats is needed for different scenarios. For example, specifying the schema in SQL function {{from_json}}, and specifying the customized JDBC data types. In the submitted PRs, the idea is to ask users to specify the table schema in the JSON format. This is not user friendly. was: Specifying the table schema in DDL formats is needed for different scenarios. For example, specifying the schema in SQL function {from_json}, and specifying the customized JDBC data types. In the submitted PRs, the idea is to ask users to specify the table schema in the JSON format. This is not user friendly. > Add parseTableSchema API to ParserInterface > --- > > Key: SPARK-19830 > URL: https://issues.apache.org/jira/browse/SPARK-19830 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Xiao Li >Assignee: Xiao Li > > Specifying the table schema in DDL formats is needed for different scenarios. > For example, specifying the schema in SQL function {{from_json}}, and > specifying the customized JDBC data types. In the submitted PRs, the idea is > to ask users to specify the table schema in the JSON format. This is not user > friendly. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19830) Add parseTableSchema API to ParserInterface
Xiao Li created SPARK-19830: --- Summary: Add parseTableSchema API to ParserInterface Key: SPARK-19830 URL: https://issues.apache.org/jira/browse/SPARK-19830 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.1.0 Reporter: Xiao Li Assignee: Xiao Li Specifying the table schema in DDL formats is needed for different scenarios. For example, specifying the schema in SQL function {from_json}, and specifying the customized JDBC data types. In the submitted PRs, the idea is to ask users to specify the table schema in the JSON format. This is not user friendly. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-19815) Not orderable should be applied to right key instead of left key
[ https://issues.apache.org/jira/browse/SPARK-19815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li closed SPARK-19815. --- Resolution: Won't Fix > Not orderable should be applied to right key instead of left key > > > Key: SPARK-19815 > URL: https://issues.apache.org/jira/browse/SPARK-19815 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Zhan Zhang >Priority: Minor > > When generating ShuffledHashJoinExec, the orderable condition should be > applied to right key instead of left key. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19829) The log about driver should support rolling like executor
[ https://issues.apache.org/jira/browse/SPARK-19829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hustfxj updated SPARK-19829: Description: We should rollback the log of the driver , or the log maybe large!!! {code:title=DriverRunner.java|borderStyle=solid} // modify the runDriver private def runDriver(builder: ProcessBuilder, baseDir: File, supervise: Boolean): Int = { builder.directory(baseDir) def initialize(process: Process): Unit = { // Redirect stdout and stderr to files-- the old code // val stdout = new File(baseDir, "stdout") // CommandUtils.redirectStream(process.getInputStream, stdout) // // val stderr = new File(baseDir, "stderr") // val formattedCommand = builder.command.asScala.mkString("\"", "\" \"", "\"") // val header = "Launch Command: %s\n%s\n\n".format(formattedCommand, "=" * 40) // Files.append(header, stderr, StandardCharsets.UTF_8) // CommandUtils.redirectStream(process.getErrorStream, stderr) // Redirect its stdout and stderr to files-support rolling val stdout = new File(baseDir, "stdout") stdoutAppender = FileAppender(process.getInputStream, stdout, conf) val stderr = new File(baseDir, "stderr") val formattedCommand = builder.command.asScala.mkString("\"", "\" \"", "\"") val header = "Launch Command: %s\n%s\n\n".format(formattedCommand, "=" * 40) Files.append(header, stderr, StandardCharsets.UTF_8) stderrAppender = FileAppender(process.getErrorStream, stderr, conf) } runCommandWithRetry(ProcessBuilderLike(builder), initialize, supervise) } {code} was: We should rollback the log of the driver , or the log maybe large!!! {code:title=Bar.java|borderStyle=solid} // modify the runDriver private def runDriver(builder: ProcessBuilder, baseDir: File, supervise: Boolean): Int = { builder.directory(baseDir) def initialize(process: Process): Unit = { // Redirect stdout and stderr to files-- the old code // val stdout = new File(baseDir, "stdout") // CommandUtils.redirectStream(process.getInputStream, stdout) // // val stderr = new File(baseDir, "stderr") // val formattedCommand = builder.command.asScala.mkString("\"", "\" \"", "\"") // val header = "Launch Command: %s\n%s\n\n".format(formattedCommand, "=" * 40) // Files.append(header, stderr, StandardCharsets.UTF_8) // CommandUtils.redirectStream(process.getErrorStream, stderr) // Redirect its stdout and stderr to files-support rolling val stdout = new File(baseDir, "stdout") stdoutAppender = FileAppender(process.getInputStream, stdout, conf) val stderr = new File(baseDir, "stderr") val formattedCommand = builder.command.asScala.mkString("\"", "\" \"", "\"") val header = "Launch Command: %s\n%s\n\n".format(formattedCommand, "=" * 40) Files.append(header, stderr, StandardCharsets.UTF_8) stderrAppender = FileAppender(process.getErrorStream, stderr, conf) } runCommandWithRetry(ProcessBuilderLike(builder), initialize, supervise) } {code} > The log about driver should support rolling like executor > - > > Key: SPARK-19829 > URL: https://issues.apache.org/jira/browse/SPARK-19829 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: hustfxj > > We should rollback the log of the driver , or the log maybe large!!! > {code:title=DriverRunner.java|borderStyle=solid} > // modify the runDriver > private def runDriver(builder: ProcessBuilder, baseDir: File, supervise: > Boolean): Int = { > builder.directory(baseDir) > def initialize(process: Process): Unit = { > // Redirect stdout and stderr to files-- the old code > // val stdout = new File(baseDir, "stdout") > // CommandUtils.redirectStream(process.getInputStream, stdout) > // > // val stderr = new File(baseDir, "stderr") > // val formattedCommand = builder.command.asScala.mkString("\"", "\" > \"", "\"") > // val header = "Launch Command: %s\n%s\n\n".format(formattedCommand, > "=" * 40) > // Files.append(header, stderr, StandardCharsets.UTF_8) > // CommandUtils.redirectStream(process.getErrorStream, stderr) > // Redirect its stdout and stderr to files-support rolling > val stdout = new File(baseDir, "stdout") > stdoutAppender = FileAppender(process.getInputStream, stdout, conf) > val stderr = new File(baseDir, "stderr") > val formattedCommand = builder.command.asScala.mkString("\"", "\" \"", > "\"") > val header = "Launch Command: %s\n%s\n\n".format(formattedCommand, "=" > * 40) > Files.append(header, stderr, StandardCharsets.UTF_8) > stderrAppender =
[jira] [Created] (SPARK-19829) The log about driver should support rolling like executor
hustfxj created SPARK-19829: --- Summary: The log about driver should support rolling like executor Key: SPARK-19829 URL: https://issues.apache.org/jira/browse/SPARK-19829 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 3.0.0 Reporter: hustfxj We should rollback the log of the driver , or the log maybe large!!! {code:title=Bar.java|borderStyle=solid} // modify the runDriver private def runDriver(builder: ProcessBuilder, baseDir: File, supervise: Boolean): Int = { builder.directory(baseDir) def initialize(process: Process): Unit = { // Redirect stdout and stderr to files-- the old code // val stdout = new File(baseDir, "stdout") // CommandUtils.redirectStream(process.getInputStream, stdout) // // val stderr = new File(baseDir, "stderr") // val formattedCommand = builder.command.asScala.mkString("\"", "\" \"", "\"") // val header = "Launch Command: %s\n%s\n\n".format(formattedCommand, "=" * 40) // Files.append(header, stderr, StandardCharsets.UTF_8) // CommandUtils.redirectStream(process.getErrorStream, stderr) // Redirect its stdout and stderr to files-support rolling val stdout = new File(baseDir, "stdout") stdoutAppender = FileAppender(process.getInputStream, stdout, conf) val stderr = new File(baseDir, "stderr") val formattedCommand = builder.command.asScala.mkString("\"", "\" \"", "\"") val header = "Launch Command: %s\n%s\n\n".format(formattedCommand, "=" * 40) Files.append(header, stderr, StandardCharsets.UTF_8) stderrAppender = FileAppender(process.getErrorStream, stderr, conf) } runCommandWithRetry(ProcessBuilderLike(builder), initialize, supervise) } {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19825) spark.ml R API for FPGrowth
[ https://issues.apache.org/jira/browse/SPARK-19825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15896665#comment-15896665 ] Apache Spark commented on SPARK-19825: -- User 'zero323' has created a pull request for this issue: https://github.com/apache/spark/pull/17170 > spark.ml R API for FPGrowth > --- > > Key: SPARK-19825 > URL: https://issues.apache.org/jira/browse/SPARK-19825 > Project: Spark > Issue Type: Sub-task > Components: ML, SparkR >Affects Versions: 2.1.0 >Reporter: Felix Cheung > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19825) spark.ml R API for FPGrowth
[ https://issues.apache.org/jira/browse/SPARK-19825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19825: Assignee: Apache Spark > spark.ml R API for FPGrowth > --- > > Key: SPARK-19825 > URL: https://issues.apache.org/jira/browse/SPARK-19825 > Project: Spark > Issue Type: Sub-task > Components: ML, SparkR >Affects Versions: 2.1.0 >Reporter: Felix Cheung >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19825) spark.ml R API for FPGrowth
[ https://issues.apache.org/jira/browse/SPARK-19825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19825: Assignee: (was: Apache Spark) > spark.ml R API for FPGrowth > --- > > Key: SPARK-19825 > URL: https://issues.apache.org/jira/browse/SPARK-19825 > Project: Spark > Issue Type: Sub-task > Components: ML, SparkR >Affects Versions: 2.1.0 >Reporter: Felix Cheung > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18085) Better History Server scalability for many / large applications
[ https://issues.apache.org/jira/browse/SPARK-18085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15896652#comment-15896652 ] DjvuLee commented on SPARK-18085: - "A separate jar file" means we generate a new jar file for the history function, just like Spark put `network` function in a new jar file, not in the Spark-core,I just want do not impact the existing jar file。 You can update the information when this design is ready for complete, maybe many people want to try it. Thanks for your reply! > Better History Server scalability for many / large applications > --- > > Key: SPARK-18085 > URL: https://issues.apache.org/jira/browse/SPARK-18085 > Project: Spark > Issue Type: Umbrella > Components: Spark Core, Web UI >Affects Versions: 2.0.0 >Reporter: Marcelo Vanzin > Attachments: spark_hs_next_gen.pdf > > > It's a known fact that the History Server currently has some annoying issues > when serving lots of applications, and when serving large applications. > I'm filing this umbrella to track work related to addressing those issues. > I'll be attaching a document shortly describing the issues and suggesting a > path to how to solve them. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19822) CheckpointSuite.testCheckpointedOperation: should not check checkpointFilesOfLatestTime by the PATH string.
[ https://issues.apache.org/jira/browse/SPARK-19822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu resolved SPARK-19822. -- Resolution: Fixed Assignee: Genmao Yu Fix Version/s: 2.2.0 2.1.1 2.0.3 > CheckpointSuite.testCheckpointedOperation: should not check > checkpointFilesOfLatestTime by the PATH string. > --- > > Key: SPARK-19822 > URL: https://issues.apache.org/jira/browse/SPARK-19822 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 2.0.2, 2.1.0 >Reporter: Genmao Yu >Assignee: Genmao Yu >Priority: Minor > Fix For: 2.0.3, 2.1.1, 2.2.0 > > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19701) the `in` operator in pyspark is broken
[ https://issues.apache.org/jira/browse/SPARK-19701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-19701: --- Assignee: Hyukjin Kwon > the `in` operator in pyspark is broken > -- > > Key: SPARK-19701 > URL: https://issues.apache.org/jira/browse/SPARK-19701 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.2.0 >Reporter: Wenchen Fan >Assignee: Hyukjin Kwon > Fix For: 2.2.0 > > > {code} > >>> textFile = spark.read.text("/Users/cloud/dev/spark/README.md") > >>> linesWithSpark = textFile.filter("Spark" in textFile.value) > Traceback (most recent call last): > File "", line 1, in > File "/Users/cloud/product/spark/python/pyspark/sql/column.py", line 426, > in __nonzero__ > raise ValueError("Cannot convert column into bool: please use '&' for > 'and', '|' for 'or', " > ValueError: Cannot convert column into bool: please use '&' for 'and', '|' > for 'or', '~' for 'not' when building DataFrame boolean expressions. > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19701) the `in` operator in pyspark is broken
[ https://issues.apache.org/jira/browse/SPARK-19701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-19701. - Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 17160 [https://github.com/apache/spark/pull/17160] > the `in` operator in pyspark is broken > -- > > Key: SPARK-19701 > URL: https://issues.apache.org/jira/browse/SPARK-19701 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.2.0 >Reporter: Wenchen Fan > Fix For: 2.2.0 > > > {code} > >>> textFile = spark.read.text("/Users/cloud/dev/spark/README.md") > >>> linesWithSpark = textFile.filter("Spark" in textFile.value) > Traceback (most recent call last): > File "", line 1, in > File "/Users/cloud/product/spark/python/pyspark/sql/column.py", line 426, > in __nonzero__ > raise ValueError("Cannot convert column into bool: please use '&' for > 'and', '|' for 'or', " > ValueError: Cannot convert column into bool: please use '&' for 'and', '|' > for 'or', '~' for 'not' when building DataFrame boolean expressions. > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19535) ALSModel recommendAll analogs
[ https://issues.apache.org/jira/browse/SPARK-19535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-19535. --- Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 17090 [https://github.com/apache/spark/pull/17090] > ALSModel recommendAll analogs > - > > Key: SPARK-19535 > URL: https://issues.apache.org/jira/browse/SPARK-19535 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 2.2.0 >Reporter: Joseph K. Bradley >Assignee: Sue Ann Hong > Fix For: 2.2.0 > > > Add methods analogous to the spark.mllib MatrixFactorizationModel methods > recommendProductsForUsers/UsersForProducts. > The initial implementation should be very simple, using DataFrame joins. > Future work can add optimizations. > I recommend naming them: > * recommendForAllUsers > * recommendForAllItems -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19828) R to support JSON array in column from_json
[ https://issues.apache.org/jira/browse/SPARK-19828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix Cheung updated SPARK-19828: - Summary: R to support JSON array in column from_json (was: R support JSON array in column from_json) > R to support JSON array in column from_json > --- > > Key: SPARK-19828 > URL: https://issues.apache.org/jira/browse/SPARK-19828 > Project: Spark > Issue Type: Bug > Components: SparkR, SQL >Affects Versions: 2.2.0 >Reporter: Felix Cheung > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19828) R support JSON array in column from_json
[ https://issues.apache.org/jira/browse/SPARK-19828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15896622#comment-15896622 ] Felix Cheung commented on SPARK-19828: -- see SPARK-19595 > R support JSON array in column from_json > > > Key: SPARK-19828 > URL: https://issues.apache.org/jira/browse/SPARK-19828 > Project: Spark > Issue Type: Bug > Components: SparkR, SQL >Affects Versions: 2.2.0 >Reporter: Felix Cheung > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19828) R support JSON array in column from_json
Felix Cheung created SPARK-19828: Summary: R support JSON array in column from_json Key: SPARK-19828 URL: https://issues.apache.org/jira/browse/SPARK-19828 Project: Spark Issue Type: Bug Components: SparkR, SQL Affects Versions: 2.2.0 Reporter: Felix Cheung -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18085) Better History Server scalability for many / large applications
[ https://issues.apache.org/jira/browse/SPARK-18085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15896619#comment-15896619 ] Marcelo Vanzin commented on SPARK-18085: Not sure what you mean by "a separate jar file". There's code that I linked to if you want to try. But it's not complete, probably has tons of bugs, and it's nowhere near usable at the moment. > Better History Server scalability for many / large applications > --- > > Key: SPARK-18085 > URL: https://issues.apache.org/jira/browse/SPARK-18085 > Project: Spark > Issue Type: Umbrella > Components: Spark Core, Web UI >Affects Versions: 2.0.0 >Reporter: Marcelo Vanzin > Attachments: spark_hs_next_gen.pdf > > > It's a known fact that the History Server currently has some annoying issues > when serving lots of applications, and when serving large applications. > I'm filing this umbrella to track work related to addressing those issues. > I'll be attaching a document shortly describing the issues and suggesting a > path to how to solve them. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19765) UNCACHE TABLE should also un-cache all cached plans that refer to this table
[ https://issues.apache.org/jira/browse/SPARK-19765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-19765: Summary: UNCACHE TABLE should also un-cache all cached plans that refer to this table (was: UNCACHE TABLE should also re-cache all cached plans that refer to this table) > UNCACHE TABLE should also un-cache all cached plans that refer to this table > > > Key: SPARK-19765 > URL: https://issues.apache.org/jira/browse/SPARK-19765 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19656) Can't load custom type from avro file to RDD with newAPIHadoopFile
[ https://issues.apache.org/jira/browse/SPARK-19656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15896590#comment-15896590 ] Nira Amit commented on SPARK-19656: --- I will not, but please consider documenting the correct way to work with the newAPIHadoopFile in Java. It is not as easy as working with it in Scala and I've been googling this enough to know that it is not clear to many Java developers who try to use it. > Can't load custom type from avro file to RDD with newAPIHadoopFile > -- > > Key: SPARK-19656 > URL: https://issues.apache.org/jira/browse/SPARK-19656 > Project: Spark > Issue Type: Question > Components: Java API >Affects Versions: 2.0.2 >Reporter: Nira Amit > > If I understand correctly, in scala it's possible to load custom objects from > avro files to RDDs this way: > {code} > ctx.hadoopFile("/path/to/the/avro/file.avro", > classOf[AvroInputFormat[MyClassInAvroFile]], > classOf[AvroWrapper[MyClassInAvroFile]], > classOf[NullWritable]) > {code} > I'm not a scala developer, so I tried to "translate" this to java as best I > could. I created classes that extend AvroKey and FileInputFormat: > {code} > public static class MyCustomAvroKey extends AvroKey{}; > public static class MyCustomAvroReader extends > AvroRecordReaderBase{ > // with my custom schema and all the required methods... > } > public static class MyCustomInputFormat extends > FileInputFormat { > @Override > public RecordReader > createRecordReader(InputSplit inputSplit, TaskAttemptContext > taskAttemptContext) throws IOException, InterruptedException { > return new MyCustomAvroReader(); > } > } > ... > JavaPairRDD records = > sc.newAPIHadoopFile("file:/path/to/datafile.avro", > MyCustomInputFormat.class, MyCustomAvroKey.class, > NullWritable.class, > sc.hadoopConfiguration()); > MyCustomClass first = records.first()._1.datum(); > System.out.println("Got a result, some custom field: " + > first.getSomeCustomField()); > {code} > This compiles fine, but using a debugger I can see that `first._1.datum()` > actually returns a `GenericData$Record` in runtime, not a `MyCustomClass` > instance. > And indeed, when the following line executes: > {code} > MyCustomClass first = records.first()._1.datum(); > {code} > I get an exception: > {code} > java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record > cannot be cast to my.package.containing.MyCustomClass > {code} > Am I doing it wrong? Or is this not possible in Java? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-19656) Can't load custom type from avro file to RDD with newAPIHadoopFile
[ https://issues.apache.org/jira/browse/SPARK-19656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen closed SPARK-19656. - > Can't load custom type from avro file to RDD with newAPIHadoopFile > -- > > Key: SPARK-19656 > URL: https://issues.apache.org/jira/browse/SPARK-19656 > Project: Spark > Issue Type: Question > Components: Java API >Affects Versions: 2.0.2 >Reporter: Nira Amit > > If I understand correctly, in scala it's possible to load custom objects from > avro files to RDDs this way: > {code} > ctx.hadoopFile("/path/to/the/avro/file.avro", > classOf[AvroInputFormat[MyClassInAvroFile]], > classOf[AvroWrapper[MyClassInAvroFile]], > classOf[NullWritable]) > {code} > I'm not a scala developer, so I tried to "translate" this to java as best I > could. I created classes that extend AvroKey and FileInputFormat: > {code} > public static class MyCustomAvroKey extends AvroKey{}; > public static class MyCustomAvroReader extends > AvroRecordReaderBase{ > // with my custom schema and all the required methods... > } > public static class MyCustomInputFormat extends > FileInputFormat { > @Override > public RecordReader > createRecordReader(InputSplit inputSplit, TaskAttemptContext > taskAttemptContext) throws IOException, InterruptedException { > return new MyCustomAvroReader(); > } > } > ... > JavaPairRDD records = > sc.newAPIHadoopFile("file:/path/to/datafile.avro", > MyCustomInputFormat.class, MyCustomAvroKey.class, > NullWritable.class, > sc.hadoopConfiguration()); > MyCustomClass first = records.first()._1.datum(); > System.out.println("Got a result, some custom field: " + > first.getSomeCustomField()); > {code} > This compiles fine, but using a debugger I can see that `first._1.datum()` > actually returns a `GenericData$Record` in runtime, not a `MyCustomClass` > instance. > And indeed, when the following line executes: > {code} > MyCustomClass first = records.first()._1.datum(); > {code} > I get an exception: > {code} > java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record > cannot be cast to my.package.containing.MyCustomClass > {code} > Am I doing it wrong? Or is this not possible in Java? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19595) from_json produces only a single row when input is a json array
[ https://issues.apache.org/jira/browse/SPARK-19595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Burak Yavuz resolved SPARK-19595. - Resolution: Fixed Fix Version/s: 2.2.0 Resolved by https://github.com/apache/spark/pull/16929 > from_json produces only a single row when input is a json array > --- > > Key: SPARK-19595 > URL: https://issues.apache.org/jira/browse/SPARK-19595 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Minor > Fix For: 2.2.0 > > > Currently, {{from_json}} reads a single row when it is a json array. For > example, > {code} > import org.apache.spark.sql.functions._ > import org.apache.spark.sql.types._ > val schema = StructType(StructField("a", IntegerType) :: Nil) > Seq(("""[{"a": 1}, {"a": > 2}]""")).toDF("struct").select(from_json(col("struct"), schema)).show() > ++ > |jsontostruct(struct)| > ++ > | [1]| > ++ > {code} > Maybe we should not support this in that function or it should work like a > generator expression. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19595) from_json produces only a single row when input is a json array
[ https://issues.apache.org/jira/browse/SPARK-19595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Burak Yavuz reassigned SPARK-19595: --- Assignee: Hyukjin Kwon > from_json produces only a single row when input is a json array > --- > > Key: SPARK-19595 > URL: https://issues.apache.org/jira/browse/SPARK-19595 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Minor > Fix For: 2.2.0 > > > Currently, {{from_json}} reads a single row when it is a json array. For > example, > {code} > import org.apache.spark.sql.functions._ > import org.apache.spark.sql.types._ > val schema = StructType(StructField("a", IntegerType) :: Nil) > Seq(("""[{"a": 1}, {"a": > 2}]""")).toDF("struct").select(from_json(col("struct"), schema)).show() > ++ > |jsontostruct(struct)| > ++ > | [1]| > ++ > {code} > Maybe we should not support this in that function or it should work like a > generator expression. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-19656) Can't load custom type from avro file to RDD with newAPIHadoopFile
[ https://issues.apache.org/jira/browse/SPARK-19656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reopened SPARK-19656: --- > Can't load custom type from avro file to RDD with newAPIHadoopFile > -- > > Key: SPARK-19656 > URL: https://issues.apache.org/jira/browse/SPARK-19656 > Project: Spark > Issue Type: Question > Components: Java API >Affects Versions: 2.0.2 >Reporter: Nira Amit > > If I understand correctly, in scala it's possible to load custom objects from > avro files to RDDs this way: > {code} > ctx.hadoopFile("/path/to/the/avro/file.avro", > classOf[AvroInputFormat[MyClassInAvroFile]], > classOf[AvroWrapper[MyClassInAvroFile]], > classOf[NullWritable]) > {code} > I'm not a scala developer, so I tried to "translate" this to java as best I > could. I created classes that extend AvroKey and FileInputFormat: > {code} > public static class MyCustomAvroKey extends AvroKey{}; > public static class MyCustomAvroReader extends > AvroRecordReaderBase{ > // with my custom schema and all the required methods... > } > public static class MyCustomInputFormat extends > FileInputFormat { > @Override > public RecordReader > createRecordReader(InputSplit inputSplit, TaskAttemptContext > taskAttemptContext) throws IOException, InterruptedException { > return new MyCustomAvroReader(); > } > } > ... > JavaPairRDD records = > sc.newAPIHadoopFile("file:/path/to/datafile.avro", > MyCustomInputFormat.class, MyCustomAvroKey.class, > NullWritable.class, > sc.hadoopConfiguration()); > MyCustomClass first = records.first()._1.datum(); > System.out.println("Got a result, some custom field: " + > first.getSomeCustomField()); > {code} > This compiles fine, but using a debugger I can see that `first._1.datum()` > actually returns a `GenericData$Record` in runtime, not a `MyCustomClass` > instance. > And indeed, when the following line executes: > {code} > MyCustomClass first = records.first()._1.datum(); > {code} > I get an exception: > {code} > java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record > cannot be cast to my.package.containing.MyCustomClass > {code} > Am I doing it wrong? Or is this not possible in Java? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19656) Can't load custom type from avro file to RDD with newAPIHadoopFile
[ https://issues.apache.org/jira/browse/SPARK-19656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-19656. --- Resolution: Not A Problem > Can't load custom type from avro file to RDD with newAPIHadoopFile > -- > > Key: SPARK-19656 > URL: https://issues.apache.org/jira/browse/SPARK-19656 > Project: Spark > Issue Type: Question > Components: Java API >Affects Versions: 2.0.2 >Reporter: Nira Amit > > If I understand correctly, in scala it's possible to load custom objects from > avro files to RDDs this way: > {code} > ctx.hadoopFile("/path/to/the/avro/file.avro", > classOf[AvroInputFormat[MyClassInAvroFile]], > classOf[AvroWrapper[MyClassInAvroFile]], > classOf[NullWritable]) > {code} > I'm not a scala developer, so I tried to "translate" this to java as best I > could. I created classes that extend AvroKey and FileInputFormat: > {code} > public static class MyCustomAvroKey extends AvroKey{}; > public static class MyCustomAvroReader extends > AvroRecordReaderBase{ > // with my custom schema and all the required methods... > } > public static class MyCustomInputFormat extends > FileInputFormat { > @Override > public RecordReader > createRecordReader(InputSplit inputSplit, TaskAttemptContext > taskAttemptContext) throws IOException, InterruptedException { > return new MyCustomAvroReader(); > } > } > ... > JavaPairRDD records = > sc.newAPIHadoopFile("file:/path/to/datafile.avro", > MyCustomInputFormat.class, MyCustomAvroKey.class, > NullWritable.class, > sc.hadoopConfiguration()); > MyCustomClass first = records.first()._1.datum(); > System.out.println("Got a result, some custom field: " + > first.getSomeCustomField()); > {code} > This compiles fine, but using a debugger I can see that `first._1.datum()` > actually returns a `GenericData$Record` in runtime, not a `MyCustomClass` > instance. > And indeed, when the following line executes: > {code} > MyCustomClass first = records.first()._1.datum(); > {code} > I get an exception: > {code} > java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record > cannot be cast to my.package.containing.MyCustomClass > {code} > Am I doing it wrong? Or is this not possible in Java? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19656) Can't load custom type from avro file to RDD with newAPIHadoopFile
[ https://issues.apache.org/jira/browse/SPARK-19656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-19656. --- Resolution: Fixed I do not see anything surprising given your description. Please don't reopen this. > Can't load custom type from avro file to RDD with newAPIHadoopFile > -- > > Key: SPARK-19656 > URL: https://issues.apache.org/jira/browse/SPARK-19656 > Project: Spark > Issue Type: Question > Components: Java API >Affects Versions: 2.0.2 >Reporter: Nira Amit > > If I understand correctly, in scala it's possible to load custom objects from > avro files to RDDs this way: > {code} > ctx.hadoopFile("/path/to/the/avro/file.avro", > classOf[AvroInputFormat[MyClassInAvroFile]], > classOf[AvroWrapper[MyClassInAvroFile]], > classOf[NullWritable]) > {code} > I'm not a scala developer, so I tried to "translate" this to java as best I > could. I created classes that extend AvroKey and FileInputFormat: > {code} > public static class MyCustomAvroKey extends AvroKey{}; > public static class MyCustomAvroReader extends > AvroRecordReaderBase{ > // with my custom schema and all the required methods... > } > public static class MyCustomInputFormat extends > FileInputFormat { > @Override > public RecordReader > createRecordReader(InputSplit inputSplit, TaskAttemptContext > taskAttemptContext) throws IOException, InterruptedException { > return new MyCustomAvroReader(); > } > } > ... > JavaPairRDD records = > sc.newAPIHadoopFile("file:/path/to/datafile.avro", > MyCustomInputFormat.class, MyCustomAvroKey.class, > NullWritable.class, > sc.hadoopConfiguration()); > MyCustomClass first = records.first()._1.datum(); > System.out.println("Got a result, some custom field: " + > first.getSomeCustomField()); > {code} > This compiles fine, but using a debugger I can see that `first._1.datum()` > actually returns a `GenericData$Record` in runtime, not a `MyCustomClass` > instance. > And indeed, when the following line executes: > {code} > MyCustomClass first = records.first()._1.datum(); > {code} > I get an exception: > {code} > java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record > cannot be cast to my.package.containing.MyCustomClass > {code} > Am I doing it wrong? Or is this not possible in Java? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-19656) Can't load custom type from avro file to RDD with newAPIHadoopFile
[ https://issues.apache.org/jira/browse/SPARK-19656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen closed SPARK-19656. - > Can't load custom type from avro file to RDD with newAPIHadoopFile > -- > > Key: SPARK-19656 > URL: https://issues.apache.org/jira/browse/SPARK-19656 > Project: Spark > Issue Type: Question > Components: Java API >Affects Versions: 2.0.2 >Reporter: Nira Amit > > If I understand correctly, in scala it's possible to load custom objects from > avro files to RDDs this way: > {code} > ctx.hadoopFile("/path/to/the/avro/file.avro", > classOf[AvroInputFormat[MyClassInAvroFile]], > classOf[AvroWrapper[MyClassInAvroFile]], > classOf[NullWritable]) > {code} > I'm not a scala developer, so I tried to "translate" this to java as best I > could. I created classes that extend AvroKey and FileInputFormat: > {code} > public static class MyCustomAvroKey extends AvroKey{}; > public static class MyCustomAvroReader extends > AvroRecordReaderBase{ > // with my custom schema and all the required methods... > } > public static class MyCustomInputFormat extends > FileInputFormat { > @Override > public RecordReader > createRecordReader(InputSplit inputSplit, TaskAttemptContext > taskAttemptContext) throws IOException, InterruptedException { > return new MyCustomAvroReader(); > } > } > ... > JavaPairRDD records = > sc.newAPIHadoopFile("file:/path/to/datafile.avro", > MyCustomInputFormat.class, MyCustomAvroKey.class, > NullWritable.class, > sc.hadoopConfiguration()); > MyCustomClass first = records.first()._1.datum(); > System.out.println("Got a result, some custom field: " + > first.getSomeCustomField()); > {code} > This compiles fine, but using a debugger I can see that `first._1.datum()` > actually returns a `GenericData$Record` in runtime, not a `MyCustomClass` > instance. > And indeed, when the following line executes: > {code} > MyCustomClass first = records.first()._1.datum(); > {code} > I get an exception: > {code} > java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record > cannot be cast to my.package.containing.MyCustomClass > {code} > Am I doing it wrong? Or is this not possible in Java? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19656) Can't load custom type from avro file to RDD with newAPIHadoopFile
[ https://issues.apache.org/jira/browse/SPARK-19656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15896581#comment-15896581 ] Nira Amit commented on SPARK-19656: --- I found a problem in my schema and managed to load my custom type. So the answer to my original question is basically no, there is nothing like {code} ctx.hadoopFile("/path/to/the/avro/file.avro", classOf[AvroInputFormat[MyClassInAvroFile]], classOf[AvroWrapper[MyClassInAvroFile]], classOf[NullWritable]) {code} for loading custom types into RDDs with the Java API. We have to create all the wrapper classes and implement our own RecordReader. I think this should be documented somewhere. > Can't load custom type from avro file to RDD with newAPIHadoopFile > -- > > Key: SPARK-19656 > URL: https://issues.apache.org/jira/browse/SPARK-19656 > Project: Spark > Issue Type: Question > Components: Java API >Affects Versions: 2.0.2 >Reporter: Nira Amit > > If I understand correctly, in scala it's possible to load custom objects from > avro files to RDDs this way: > {code} > ctx.hadoopFile("/path/to/the/avro/file.avro", > classOf[AvroInputFormat[MyClassInAvroFile]], > classOf[AvroWrapper[MyClassInAvroFile]], > classOf[NullWritable]) > {code} > I'm not a scala developer, so I tried to "translate" this to java as best I > could. I created classes that extend AvroKey and FileInputFormat: > {code} > public static class MyCustomAvroKey extends AvroKey{}; > public static class MyCustomAvroReader extends > AvroRecordReaderBase{ > // with my custom schema and all the required methods... > } > public static class MyCustomInputFormat extends > FileInputFormat { > @Override > public RecordReader > createRecordReader(InputSplit inputSplit, TaskAttemptContext > taskAttemptContext) throws IOException, InterruptedException { > return new MyCustomAvroReader(); > } > } > ... > JavaPairRDD records = > sc.newAPIHadoopFile("file:/path/to/datafile.avro", > MyCustomInputFormat.class, MyCustomAvroKey.class, > NullWritable.class, > sc.hadoopConfiguration()); > MyCustomClass first = records.first()._1.datum(); > System.out.println("Got a result, some custom field: " + > first.getSomeCustomField()); > {code} > This compiles fine, but using a debugger I can see that `first._1.datum()` > actually returns a `GenericData$Record` in runtime, not a `MyCustomClass` > instance. > And indeed, when the following line executes: > {code} > MyCustomClass first = records.first()._1.datum(); > {code} > I get an exception: > {code} > java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record > cannot be cast to my.package.containing.MyCustomClass > {code} > Am I doing it wrong? Or is this not possible in Java? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19656) Can't load custom type from avro file to RDD with newAPIHadoopFile
[ https://issues.apache.org/jira/browse/SPARK-19656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nira Amit updated SPARK-19656: -- Attachment: (was: datum2.png) > Can't load custom type from avro file to RDD with newAPIHadoopFile > -- > > Key: SPARK-19656 > URL: https://issues.apache.org/jira/browse/SPARK-19656 > Project: Spark > Issue Type: Question > Components: Java API >Affects Versions: 2.0.2 >Reporter: Nira Amit > > If I understand correctly, in scala it's possible to load custom objects from > avro files to RDDs this way: > {code} > ctx.hadoopFile("/path/to/the/avro/file.avro", > classOf[AvroInputFormat[MyClassInAvroFile]], > classOf[AvroWrapper[MyClassInAvroFile]], > classOf[NullWritable]) > {code} > I'm not a scala developer, so I tried to "translate" this to java as best I > could. I created classes that extend AvroKey and FileInputFormat: > {code} > public static class MyCustomAvroKey extends AvroKey{}; > public static class MyCustomAvroReader extends > AvroRecordReaderBase{ > // with my custom schema and all the required methods... > } > public static class MyCustomInputFormat extends > FileInputFormat { > @Override > public RecordReader > createRecordReader(InputSplit inputSplit, TaskAttemptContext > taskAttemptContext) throws IOException, InterruptedException { > return new MyCustomAvroReader(); > } > } > ... > JavaPairRDD records = > sc.newAPIHadoopFile("file:/path/to/datafile.avro", > MyCustomInputFormat.class, MyCustomAvroKey.class, > NullWritable.class, > sc.hadoopConfiguration()); > MyCustomClass first = records.first()._1.datum(); > System.out.println("Got a result, some custom field: " + > first.getSomeCustomField()); > {code} > This compiles fine, but using a debugger I can see that `first._1.datum()` > actually returns a `GenericData$Record` in runtime, not a `MyCustomClass` > instance. > And indeed, when the following line executes: > {code} > MyCustomClass first = records.first()._1.datum(); > {code} > I get an exception: > {code} > java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record > cannot be cast to my.package.containing.MyCustomClass > {code} > Am I doing it wrong? Or is this not possible in Java? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-19656) Can't load custom type from avro file to RDD with newAPIHadoopFile
[ https://issues.apache.org/jira/browse/SPARK-19656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nira Amit updated SPARK-19656: -- Comment: was deleted (was: {code} public static class ABgoEventAvroReader extends AvroRecordReaderBase{ static Schema schema; static { try { schema = new Schema.Parser().parse(AvroTest.class.getResourceAsStream("/Abgo.avsc")); } catch (IOException e) { throw new RuntimeException(e); } } /** A reusable object to hold records of the Avro container file. */ private final ABgoEventAvroKey mCurrentRecord; public ABgoEventAvroReader() { super(schema); mCurrentRecord = new ABgoEventAvroKey(); } /** {@inheritDoc} */ @Override public boolean nextKeyValue() throws IOException, InterruptedException { boolean hasNext = super.nextKeyValue(); mCurrentRecord.datum(getCurrentRecord()); return hasNext; } /** {@inheritDoc} */ @Override public ABgoEventAvroKey getCurrentKey() throws IOException, InterruptedException { return mCurrentRecord; } /** {@inheritDoc} */ @Override public NullWritable getCurrentValue() throws IOException, InterruptedException { return NullWritable.get(); } } {code}) > Can't load custom type from avro file to RDD with newAPIHadoopFile > -- > > Key: SPARK-19656 > URL: https://issues.apache.org/jira/browse/SPARK-19656 > Project: Spark > Issue Type: Question > Components: Java API >Affects Versions: 2.0.2 >Reporter: Nira Amit > > If I understand correctly, in scala it's possible to load custom objects from > avro files to RDDs this way: > {code} > ctx.hadoopFile("/path/to/the/avro/file.avro", > classOf[AvroInputFormat[MyClassInAvroFile]], > classOf[AvroWrapper[MyClassInAvroFile]], > classOf[NullWritable]) > {code} > I'm not a scala developer, so I tried to "translate" this to java as best I > could. I created classes that extend AvroKey and FileInputFormat: > {code} > public static class MyCustomAvroKey extends AvroKey{}; > public static class MyCustomAvroReader extends > AvroRecordReaderBase { > // with my custom schema and all the required methods... > } > public static class MyCustomInputFormat extends > FileInputFormat { > @Override > public RecordReader > createRecordReader(InputSplit inputSplit, TaskAttemptContext > taskAttemptContext) throws IOException, InterruptedException { > return new MyCustomAvroReader(); > } > } > ... > JavaPairRDD records = > sc.newAPIHadoopFile("file:/path/to/datafile.avro", > MyCustomInputFormat.class, MyCustomAvroKey.class, > NullWritable.class, > sc.hadoopConfiguration()); > MyCustomClass first = records.first()._1.datum(); > System.out.println("Got a result, some custom field: " + > first.getSomeCustomField()); > {code} > This compiles fine, but using a debugger I can see that `first._1.datum()` > actually returns a `GenericData$Record` in runtime, not a `MyCustomClass` > instance. > And indeed, when the following line executes: > {code} > MyCustomClass first = records.first()._1.datum(); > {code} > I get an exception: > {code} > java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record > cannot be cast to my.package.containing.MyCustomClass > {code} > Am I doing it wrong? Or is this not possible in Java? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19656) Can't load custom type from avro file to RDD with newAPIHadoopFile
[ https://issues.apache.org/jira/browse/SPARK-19656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nira Amit updated SPARK-19656: -- Attachment: (was: datum.png) > Can't load custom type from avro file to RDD with newAPIHadoopFile > -- > > Key: SPARK-19656 > URL: https://issues.apache.org/jira/browse/SPARK-19656 > Project: Spark > Issue Type: Question > Components: Java API >Affects Versions: 2.0.2 >Reporter: Nira Amit > > If I understand correctly, in scala it's possible to load custom objects from > avro files to RDDs this way: > {code} > ctx.hadoopFile("/path/to/the/avro/file.avro", > classOf[AvroInputFormat[MyClassInAvroFile]], > classOf[AvroWrapper[MyClassInAvroFile]], > classOf[NullWritable]) > {code} > I'm not a scala developer, so I tried to "translate" this to java as best I > could. I created classes that extend AvroKey and FileInputFormat: > {code} > public static class MyCustomAvroKey extends AvroKey{}; > public static class MyCustomAvroReader extends > AvroRecordReaderBase{ > // with my custom schema and all the required methods... > } > public static class MyCustomInputFormat extends > FileInputFormat { > @Override > public RecordReader > createRecordReader(InputSplit inputSplit, TaskAttemptContext > taskAttemptContext) throws IOException, InterruptedException { > return new MyCustomAvroReader(); > } > } > ... > JavaPairRDD records = > sc.newAPIHadoopFile("file:/path/to/datafile.avro", > MyCustomInputFormat.class, MyCustomAvroKey.class, > NullWritable.class, > sc.hadoopConfiguration()); > MyCustomClass first = records.first()._1.datum(); > System.out.println("Got a result, some custom field: " + > first.getSomeCustomField()); > {code} > This compiles fine, but using a debugger I can see that `first._1.datum()` > actually returns a `GenericData$Record` in runtime, not a `MyCustomClass` > instance. > And indeed, when the following line executes: > {code} > MyCustomClass first = records.first()._1.datum(); > {code} > I get an exception: > {code} > java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record > cannot be cast to my.package.containing.MyCustomClass > {code} > Am I doing it wrong? Or is this not possible in Java? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-19656) Can't load custom type from avro file to RDD with newAPIHadoopFile
[ https://issues.apache.org/jira/browse/SPARK-19656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nira Amit updated SPARK-19656: -- Comment: was deleted (was: [~srowen] Will you at least consider the possibility that I'm on to a real problem here? I may be wrong of course, but I'm telling you that I've been struggling with this for weeks and that other developers are struggling with the same thing. At the very least it's worthwhile clarifying this issue. I'm attaching a screenshot of what my code returns in runtime, it's definitely the right class. So the problem is NOT with my file reader. My reader extends AvroRecordReaderBase.) > Can't load custom type from avro file to RDD with newAPIHadoopFile > -- > > Key: SPARK-19656 > URL: https://issues.apache.org/jira/browse/SPARK-19656 > Project: Spark > Issue Type: Question > Components: Java API >Affects Versions: 2.0.2 >Reporter: Nira Amit > Attachments: datum2.png, datum.png > > > If I understand correctly, in scala it's possible to load custom objects from > avro files to RDDs this way: > {code} > ctx.hadoopFile("/path/to/the/avro/file.avro", > classOf[AvroInputFormat[MyClassInAvroFile]], > classOf[AvroWrapper[MyClassInAvroFile]], > classOf[NullWritable]) > {code} > I'm not a scala developer, so I tried to "translate" this to java as best I > could. I created classes that extend AvroKey and FileInputFormat: > {code} > public static class MyCustomAvroKey extends AvroKey{}; > public static class MyCustomAvroReader extends > AvroRecordReaderBase { > // with my custom schema and all the required methods... > } > public static class MyCustomInputFormat extends > FileInputFormat { > @Override > public RecordReader > createRecordReader(InputSplit inputSplit, TaskAttemptContext > taskAttemptContext) throws IOException, InterruptedException { > return new MyCustomAvroReader(); > } > } > ... > JavaPairRDD records = > sc.newAPIHadoopFile("file:/path/to/datafile.avro", > MyCustomInputFormat.class, MyCustomAvroKey.class, > NullWritable.class, > sc.hadoopConfiguration()); > MyCustomClass first = records.first()._1.datum(); > System.out.println("Got a result, some custom field: " + > first.getSomeCustomField()); > {code} > This compiles fine, but using a debugger I can see that `first._1.datum()` > actually returns a `GenericData$Record` in runtime, not a `MyCustomClass` > instance. > And indeed, when the following line executes: > {code} > MyCustomClass first = records.first()._1.datum(); > {code} > I get an exception: > {code} > java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record > cannot be cast to my.package.containing.MyCustomClass > {code} > Am I doing it wrong? Or is this not possible in Java? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19795) R should support column functions to_json, from_json
[ https://issues.apache.org/jira/browse/SPARK-19795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix Cheung resolved SPARK-19795. -- Resolution: Fixed Fix Version/s: 2.2.0 Target Version/s: 2.2.0 > R should support column functions to_json, from_json > > > Key: SPARK-19795 > URL: https://issues.apache.org/jira/browse/SPARK-19795 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.1.0 >Reporter: Felix Cheung >Assignee: Felix Cheung > Fix For: 2.2.0 > > > Particularly since R does not comes with support for process JSON -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19827) spark.ml R API for PIC
[ https://issues.apache.org/jira/browse/SPARK-19827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix Cheung updated SPARK-19827: - Shepherd: Felix Cheung > spark.ml R API for PIC > -- > > Key: SPARK-19827 > URL: https://issues.apache.org/jira/browse/SPARK-19827 > Project: Spark > Issue Type: Sub-task > Components: ML, SparkR >Affects Versions: 2.1.0 >Reporter: Felix Cheung > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19825) spark.ml R API for FPGrowth
[ https://issues.apache.org/jira/browse/SPARK-19825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix Cheung updated SPARK-19825: - Shepherd: Felix Cheung > spark.ml R API for FPGrowth > --- > > Key: SPARK-19825 > URL: https://issues.apache.org/jira/browse/SPARK-19825 > Project: Spark > Issue Type: Sub-task > Components: ML, SparkR >Affects Versions: 2.1.0 >Reporter: Felix Cheung > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19827) spark.ml R API for PIC
Felix Cheung created SPARK-19827: Summary: spark.ml R API for PIC Key: SPARK-19827 URL: https://issues.apache.org/jira/browse/SPARK-19827 Project: Spark Issue Type: Sub-task Components: ML, SparkR Affects Versions: 2.1.0 Reporter: Felix Cheung -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19826) spark.ml Python API for PIC
Felix Cheung created SPARK-19826: Summary: spark.ml Python API for PIC Key: SPARK-19826 URL: https://issues.apache.org/jira/browse/SPARK-19826 Project: Spark Issue Type: Sub-task Components: ML, PySpark Affects Versions: 2.1.0 Reporter: Felix Cheung -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19825) spark.ml R API for FPGrowth
Felix Cheung created SPARK-19825: Summary: spark.ml R API for FPGrowth Key: SPARK-19825 URL: https://issues.apache.org/jira/browse/SPARK-19825 Project: Spark Issue Type: Sub-task Components: ML, SparkR Affects Versions: 2.1.0 Reporter: Felix Cheung -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19656) Can't load custom type from avro file to RDD with newAPIHadoopFile
[ https://issues.apache.org/jira/browse/SPARK-19656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nira Amit updated SPARK-19656: -- Attachment: datum2.png > Can't load custom type from avro file to RDD with newAPIHadoopFile > -- > > Key: SPARK-19656 > URL: https://issues.apache.org/jira/browse/SPARK-19656 > Project: Spark > Issue Type: Question > Components: Java API >Affects Versions: 2.0.2 >Reporter: Nira Amit > Attachments: datum2.png, datum.png > > > If I understand correctly, in scala it's possible to load custom objects from > avro files to RDDs this way: > {code} > ctx.hadoopFile("/path/to/the/avro/file.avro", > classOf[AvroInputFormat[MyClassInAvroFile]], > classOf[AvroWrapper[MyClassInAvroFile]], > classOf[NullWritable]) > {code} > I'm not a scala developer, so I tried to "translate" this to java as best I > could. I created classes that extend AvroKey and FileInputFormat: > {code} > public static class MyCustomAvroKey extends AvroKey{}; > public static class MyCustomAvroReader extends > AvroRecordReaderBase{ > // with my custom schema and all the required methods... > } > public static class MyCustomInputFormat extends > FileInputFormat { > @Override > public RecordReader > createRecordReader(InputSplit inputSplit, TaskAttemptContext > taskAttemptContext) throws IOException, InterruptedException { > return new MyCustomAvroReader(); > } > } > ... > JavaPairRDD records = > sc.newAPIHadoopFile("file:/path/to/datafile.avro", > MyCustomInputFormat.class, MyCustomAvroKey.class, > NullWritable.class, > sc.hadoopConfiguration()); > MyCustomClass first = records.first()._1.datum(); > System.out.println("Got a result, some custom field: " + > first.getSomeCustomField()); > {code} > This compiles fine, but using a debugger I can see that `first._1.datum()` > actually returns a `GenericData$Record` in runtime, not a `MyCustomClass` > instance. > And indeed, when the following line executes: > {code} > MyCustomClass first = records.first()._1.datum(); > {code} > I get an exception: > {code} > java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record > cannot be cast to my.package.containing.MyCustomClass > {code} > Am I doing it wrong? Or is this not possible in Java? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19656) Can't load custom type from avro file to RDD with newAPIHadoopFile
[ https://issues.apache.org/jira/browse/SPARK-19656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nira Amit updated SPARK-19656: -- Attachment: datum.png {code} public static class ABgoEventAvroReader extends AvroRecordReaderBase{ static Schema schema; static { try { schema = new Schema.Parser().parse(AvroTest.class.getResourceAsStream("/Abgo.avsc")); } catch (IOException e) { throw new RuntimeException(e); } } /** A reusable object to hold records of the Avro container file. */ private final ABgoEventAvroKey mCurrentRecord; public ABgoEventAvroReader() { super(schema); mCurrentRecord = new ABgoEventAvroKey(); } /** {@inheritDoc} */ @Override public boolean nextKeyValue() throws IOException, InterruptedException { boolean hasNext = super.nextKeyValue(); mCurrentRecord.datum(getCurrentRecord()); return hasNext; } /** {@inheritDoc} */ @Override public ABgoEventAvroKey getCurrentKey() throws IOException, InterruptedException { return mCurrentRecord; } /** {@inheritDoc} */ @Override public NullWritable getCurrentValue() throws IOException, InterruptedException { return NullWritable.get(); } } {code} > Can't load custom type from avro file to RDD with newAPIHadoopFile > -- > > Key: SPARK-19656 > URL: https://issues.apache.org/jira/browse/SPARK-19656 > Project: Spark > Issue Type: Question > Components: Java API >Affects Versions: 2.0.2 >Reporter: Nira Amit > Attachments: datum.png > > > If I understand correctly, in scala it's possible to load custom objects from > avro files to RDDs this way: > {code} > ctx.hadoopFile("/path/to/the/avro/file.avro", > classOf[AvroInputFormat[MyClassInAvroFile]], > classOf[AvroWrapper[MyClassInAvroFile]], > classOf[NullWritable]) > {code} > I'm not a scala developer, so I tried to "translate" this to java as best I > could. I created classes that extend AvroKey and FileInputFormat: > {code} > public static class MyCustomAvroKey extends AvroKey{}; > public static class MyCustomAvroReader extends > AvroRecordReaderBase { > // with my custom schema and all the required methods... > } > public static class MyCustomInputFormat extends > FileInputFormat { > @Override > public RecordReader > createRecordReader(InputSplit inputSplit, TaskAttemptContext > taskAttemptContext) throws IOException, InterruptedException { > return new MyCustomAvroReader(); > } > } > ... > JavaPairRDD records = > sc.newAPIHadoopFile("file:/path/to/datafile.avro", > MyCustomInputFormat.class, MyCustomAvroKey.class, > NullWritable.class, > sc.hadoopConfiguration()); > MyCustomClass first = records.first()._1.datum(); > System.out.println("Got a result, some custom field: " + > first.getSomeCustomField()); > {code} > This compiles fine, but using a debugger I can see that `first._1.datum()` > actually returns a `GenericData$Record` in runtime, not a `MyCustomClass` > instance. > And indeed, when the following line executes: > {code} > MyCustomClass first = records.first()._1.datum(); > {code} > I get an exception: > {code} > java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record > cannot be cast to my.package.containing.MyCustomClass > {code} > Am I doing it wrong? Or is this not possible in Java? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-19656) Can't load custom type from avro file to RDD with newAPIHadoopFile
[ https://issues.apache.org/jira/browse/SPARK-19656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nira Amit reopened SPARK-19656: --- [~srowen] Will you at least consider the possibility that I'm on to a real problem here? I may be wrong of course, but I'm telling you that I've been struggling with this for weeks and that other developers are struggling with the same thing. At the very least it's worthwhile clarifying this issue. I'm attaching a screenshot of what my code returns in runtime, it's definitely the right class. So the problem is NOT with my file reader. My reader extends AvroRecordReaderBase. > Can't load custom type from avro file to RDD with newAPIHadoopFile > -- > > Key: SPARK-19656 > URL: https://issues.apache.org/jira/browse/SPARK-19656 > Project: Spark > Issue Type: Question > Components: Java API >Affects Versions: 2.0.2 >Reporter: Nira Amit > > If I understand correctly, in scala it's possible to load custom objects from > avro files to RDDs this way: > {code} > ctx.hadoopFile("/path/to/the/avro/file.avro", > classOf[AvroInputFormat[MyClassInAvroFile]], > classOf[AvroWrapper[MyClassInAvroFile]], > classOf[NullWritable]) > {code} > I'm not a scala developer, so I tried to "translate" this to java as best I > could. I created classes that extend AvroKey and FileInputFormat: > {code} > public static class MyCustomAvroKey extends AvroKey{}; > public static class MyCustomAvroReader extends > AvroRecordReaderBase { > // with my custom schema and all the required methods... > } > public static class MyCustomInputFormat extends > FileInputFormat { > @Override > public RecordReader > createRecordReader(InputSplit inputSplit, TaskAttemptContext > taskAttemptContext) throws IOException, InterruptedException { > return new MyCustomAvroReader(); > } > } > ... > JavaPairRDD records = > sc.newAPIHadoopFile("file:/path/to/datafile.avro", > MyCustomInputFormat.class, MyCustomAvroKey.class, > NullWritable.class, > sc.hadoopConfiguration()); > MyCustomClass first = records.first()._1.datum(); > System.out.println("Got a result, some custom field: " + > first.getSomeCustomField()); > {code} > This compiles fine, but using a debugger I can see that `first._1.datum()` > actually returns a `GenericData$Record` in runtime, not a `MyCustomClass` > instance. > And indeed, when the following line executes: > {code} > MyCustomClass first = records.first()._1.datum(); > {code} > I get an exception: > {code} > java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record > cannot be cast to my.package.containing.MyCustomClass > {code} > Am I doing it wrong? Or is this not possible in Java? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14083) Analyze JVM bytecode and turn closures into Catalyst expressions
[ https://issues.apache.org/jira/browse/SPARK-14083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15896452#comment-15896452 ] Kazuaki Ishizaki commented on SPARK-14083: -- Does anyone go forward with this? If not, I will continue to work for this. Recently, I noticed that Spark already uses ASM framework that provides similar features to Javassist. > Analyze JVM bytecode and turn closures into Catalyst expressions > > > Key: SPARK-14083 > URL: https://issues.apache.org/jira/browse/SPARK-14083 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin > > One big advantage of the Dataset API is the type safety, at the cost of > performance due to heavy reliance on user-defined closures/lambdas. These > closures are typically slower than expressions because we have more > flexibility to optimize expressions (known data types, no virtual function > calls, etc). In many cases, it's actually not going to be very difficult to > look into the byte code of these closures and figure out what they are trying > to do. If we can understand them, then we can turn them directly into > Catalyst expressions for more optimized executions. > Some examples are: > {code} > df.map(_.name) // equivalent to expression col("name") > ds.groupBy(_.gender) // equivalent to expression col("gender") > df.filter(_.age > 18) // equivalent to expression GreaterThan(col("age"), > lit(18) > df.map(_.id + 1) // equivalent to Add(col("age"), lit(1)) > {code} > The goal of this ticket is to design a small framework for byte code analysis > and use that to convert closures/lambdas into Catalyst expressions in order > to speed up Dataset execution. It is a little bit futuristic, but I believe > it is very doable. The framework should be easy to reason about (e.g. similar > to Catalyst). > Note that a big emphasis on "small" and "easy to reason about". A patch > should be rejected if it is too complicated or difficult to reason about. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19656) Can't load custom type from avro file to RDD with newAPIHadoopFile
[ https://issues.apache.org/jira/browse/SPARK-19656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15896343#comment-15896343 ] Sean Owen commented on SPARK-19656: --- It accepts it because you tell it that's what the InputFormat will return, but it doesn't. The Class arg is there just for its compile-time type. That doesn't make it so and it doesn't have a way of verifying it's what your InputFormat returns. newAPIHadoopFile doesn't load as anything in particular; the InputFormat does. You are still really talking about Hadoop and Avro APIs. I'm going to leave the conversation there and close this, as this is as much as is reasonable to consider in the context of Spark. This is not a bug as-is. You can take this info to explore how to work with Avro values elsewhere. A JIRA can be reopened if you have a clear and reproducible problem in what Spark is supposed to return or do and what it does. That does require understanding the operation of Hadoop APIs. Questions should stay on the mailing list or SO, if it's still in the realm of "how can I get this to work?" > Can't load custom type from avro file to RDD with newAPIHadoopFile > -- > > Key: SPARK-19656 > URL: https://issues.apache.org/jira/browse/SPARK-19656 > Project: Spark > Issue Type: Question > Components: Java API >Affects Versions: 2.0.2 >Reporter: Nira Amit > > If I understand correctly, in scala it's possible to load custom objects from > avro files to RDDs this way: > {code} > ctx.hadoopFile("/path/to/the/avro/file.avro", > classOf[AvroInputFormat[MyClassInAvroFile]], > classOf[AvroWrapper[MyClassInAvroFile]], > classOf[NullWritable]) > {code} > I'm not a scala developer, so I tried to "translate" this to java as best I > could. I created classes that extend AvroKey and FileInputFormat: > {code} > public static class MyCustomAvroKey extends AvroKey{}; > public static class MyCustomAvroReader extends > AvroRecordReaderBase{ > // with my custom schema and all the required methods... > } > public static class MyCustomInputFormat extends > FileInputFormat { > @Override > public RecordReader > createRecordReader(InputSplit inputSplit, TaskAttemptContext > taskAttemptContext) throws IOException, InterruptedException { > return new MyCustomAvroReader(); > } > } > ... > JavaPairRDD records = > sc.newAPIHadoopFile("file:/path/to/datafile.avro", > MyCustomInputFormat.class, MyCustomAvroKey.class, > NullWritable.class, > sc.hadoopConfiguration()); > MyCustomClass first = records.first()._1.datum(); > System.out.println("Got a result, some custom field: " + > first.getSomeCustomField()); > {code} > This compiles fine, but using a debugger I can see that `first._1.datum()` > actually returns a `GenericData$Record` in runtime, not a `MyCustomClass` > instance. > And indeed, when the following line executes: > {code} > MyCustomClass first = records.first()._1.datum(); > {code} > I get an exception: > {code} > java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record > cannot be cast to my.package.containing.MyCustomClass > {code} > Am I doing it wrong? Or is this not possible in Java? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-19656) Can't load custom type from avro file to RDD with newAPIHadoopFile
[ https://issues.apache.org/jira/browse/SPARK-19656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen closed SPARK-19656. - > Can't load custom type from avro file to RDD with newAPIHadoopFile > -- > > Key: SPARK-19656 > URL: https://issues.apache.org/jira/browse/SPARK-19656 > Project: Spark > Issue Type: Question > Components: Java API >Affects Versions: 2.0.2 >Reporter: Nira Amit > > If I understand correctly, in scala it's possible to load custom objects from > avro files to RDDs this way: > {code} > ctx.hadoopFile("/path/to/the/avro/file.avro", > classOf[AvroInputFormat[MyClassInAvroFile]], > classOf[AvroWrapper[MyClassInAvroFile]], > classOf[NullWritable]) > {code} > I'm not a scala developer, so I tried to "translate" this to java as best I > could. I created classes that extend AvroKey and FileInputFormat: > {code} > public static class MyCustomAvroKey extends AvroKey{}; > public static class MyCustomAvroReader extends > AvroRecordReaderBase{ > // with my custom schema and all the required methods... > } > public static class MyCustomInputFormat extends > FileInputFormat { > @Override > public RecordReader > createRecordReader(InputSplit inputSplit, TaskAttemptContext > taskAttemptContext) throws IOException, InterruptedException { > return new MyCustomAvroReader(); > } > } > ... > JavaPairRDD records = > sc.newAPIHadoopFile("file:/path/to/datafile.avro", > MyCustomInputFormat.class, MyCustomAvroKey.class, > NullWritable.class, > sc.hadoopConfiguration()); > MyCustomClass first = records.first()._1.datum(); > System.out.println("Got a result, some custom field: " + > first.getSomeCustomField()); > {code} > This compiles fine, but using a debugger I can see that `first._1.datum()` > actually returns a `GenericData$Record` in runtime, not a `MyCustomClass` > instance. > And indeed, when the following line executes: > {code} > MyCustomClass first = records.first()._1.datum(); > {code} > I get an exception: > {code} > java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record > cannot be cast to my.package.containing.MyCustomClass > {code} > Am I doing it wrong? Or is this not possible in Java? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19656) Can't load custom type from avro file to RDD with newAPIHadoopFile
[ https://issues.apache.org/jira/browse/SPARK-19656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15896314#comment-15896314 ] Nira Amit commented on SPARK-19656: --- And by the way, "what is in the file" is bytes. The question is what I load these bytes into. I'm trying to load them into a MyCustomClass, apparently what newAPIHadoopFile is loading them into is GenericData$Record. Even though the return type it promises is JavaPairRDD. > Can't load custom type from avro file to RDD with newAPIHadoopFile > -- > > Key: SPARK-19656 > URL: https://issues.apache.org/jira/browse/SPARK-19656 > Project: Spark > Issue Type: Question > Components: Java API >Affects Versions: 2.0.2 >Reporter: Nira Amit > > If I understand correctly, in scala it's possible to load custom objects from > avro files to RDDs this way: > {code} > ctx.hadoopFile("/path/to/the/avro/file.avro", > classOf[AvroInputFormat[MyClassInAvroFile]], > classOf[AvroWrapper[MyClassInAvroFile]], > classOf[NullWritable]) > {code} > I'm not a scala developer, so I tried to "translate" this to java as best I > could. I created classes that extend AvroKey and FileInputFormat: > {code} > public static class MyCustomAvroKey extends AvroKey{}; > public static class MyCustomAvroReader extends > AvroRecordReaderBase { > // with my custom schema and all the required methods... > } > public static class MyCustomInputFormat extends > FileInputFormat { > @Override > public RecordReader > createRecordReader(InputSplit inputSplit, TaskAttemptContext > taskAttemptContext) throws IOException, InterruptedException { > return new MyCustomAvroReader(); > } > } > ... > JavaPairRDD records = > sc.newAPIHadoopFile("file:/path/to/datafile.avro", > MyCustomInputFormat.class, MyCustomAvroKey.class, > NullWritable.class, > sc.hadoopConfiguration()); > MyCustomClass first = records.first()._1.datum(); > System.out.println("Got a result, some custom field: " + > first.getSomeCustomField()); > {code} > This compiles fine, but using a debugger I can see that `first._1.datum()` > actually returns a `GenericData$Record` in runtime, not a `MyCustomClass` > instance. > And indeed, when the following line executes: > {code} > MyCustomClass first = records.first()._1.datum(); > {code} > I get an exception: > {code} > java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record > cannot be cast to my.package.containing.MyCustomClass > {code} > Am I doing it wrong? Or is this not possible in Java? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19714) Bucketizer Bug Regarding Handling Unbucketed Inputs
[ https://issues.apache.org/jira/browse/SPARK-19714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19714: Assignee: Apache Spark > Bucketizer Bug Regarding Handling Unbucketed Inputs > --- > > Key: SPARK-19714 > URL: https://issues.apache.org/jira/browse/SPARK-19714 > Project: Spark > Issue Type: Bug > Components: ML, MLlib >Affects Versions: 2.1.0 >Reporter: Bill Chambers >Assignee: Apache Spark > > {code} > contDF = spark.range(500).selectExpr("cast(id as double) as id") > import org.apache.spark.ml.feature.Bucketizer > val splits = Array(5.0, 10.0, 250.0, 500.0) > val bucketer = new Bucketizer() > .setSplits(splits) > .setInputCol("id") > .setHandleInvalid("skip") > bucketer.transform(contDF).show() > {code} > You would expect that this would handle the invalid buckets. However it fails > {code} > Caused by: org.apache.spark.SparkException: Feature value 0.0 out of > Bucketizer bounds [5.0, 500.0]. Check your features, or loosen the > lower/upper bound constraints. > {code} > It seems strange that handleInvalud doesn't actually handleInvalid inputs. > Thoughts anyone? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19714) Bucketizer Bug Regarding Handling Unbucketed Inputs
[ https://issues.apache.org/jira/browse/SPARK-19714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15896290#comment-15896290 ] Apache Spark commented on SPARK-19714: -- User 'wojtek-szymanski' has created a pull request for this issue: https://github.com/apache/spark/pull/17169 > Bucketizer Bug Regarding Handling Unbucketed Inputs > --- > > Key: SPARK-19714 > URL: https://issues.apache.org/jira/browse/SPARK-19714 > Project: Spark > Issue Type: Bug > Components: ML, MLlib >Affects Versions: 2.1.0 >Reporter: Bill Chambers > > {code} > contDF = spark.range(500).selectExpr("cast(id as double) as id") > import org.apache.spark.ml.feature.Bucketizer > val splits = Array(5.0, 10.0, 250.0, 500.0) > val bucketer = new Bucketizer() > .setSplits(splits) > .setInputCol("id") > .setHandleInvalid("skip") > bucketer.transform(contDF).show() > {code} > You would expect that this would handle the invalid buckets. However it fails > {code} > Caused by: org.apache.spark.SparkException: Feature value 0.0 out of > Bucketizer bounds [5.0, 500.0]. Check your features, or loosen the > lower/upper bound constraints. > {code} > It seems strange that handleInvalud doesn't actually handleInvalid inputs. > Thoughts anyone? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19714) Bucketizer Bug Regarding Handling Unbucketed Inputs
[ https://issues.apache.org/jira/browse/SPARK-19714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19714: Assignee: (was: Apache Spark) > Bucketizer Bug Regarding Handling Unbucketed Inputs > --- > > Key: SPARK-19714 > URL: https://issues.apache.org/jira/browse/SPARK-19714 > Project: Spark > Issue Type: Bug > Components: ML, MLlib >Affects Versions: 2.1.0 >Reporter: Bill Chambers > > {code} > contDF = spark.range(500).selectExpr("cast(id as double) as id") > import org.apache.spark.ml.feature.Bucketizer > val splits = Array(5.0, 10.0, 250.0, 500.0) > val bucketer = new Bucketizer() > .setSplits(splits) > .setInputCol("id") > .setHandleInvalid("skip") > bucketer.transform(contDF).show() > {code} > You would expect that this would handle the invalid buckets. However it fails > {code} > Caused by: org.apache.spark.SparkException: Feature value 0.0 out of > Bucketizer bounds [5.0, 500.0]. Check your features, or loosen the > lower/upper bound constraints. > {code} > It seems strange that handleInvalud doesn't actually handleInvalid inputs. > Thoughts anyone? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-19656) Can't load custom type from avro file to RDD with newAPIHadoopFile
[ https://issues.apache.org/jira/browse/SPARK-19656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15896286#comment-15896286 ] Nira Amit edited comment on SPARK-19656 at 3/5/17 2:55 PM: --- But then why does the compiler accept what newAPIHadoopFile returns as MyCustomClass? If what you are saying is correct, than the only return type that should be acceptable is a GenericData$Record or something that can be casted to it. was (Author: amitnira): But then why does the compiler accepts what newAPIHadoopFile returns as MyCustomClass? If what you are saying is correct, than the only return type that should be acceptable is a GenericData$Record or something that can be casted to it. > Can't load custom type from avro file to RDD with newAPIHadoopFile > -- > > Key: SPARK-19656 > URL: https://issues.apache.org/jira/browse/SPARK-19656 > Project: Spark > Issue Type: Question > Components: Java API >Affects Versions: 2.0.2 >Reporter: Nira Amit > > If I understand correctly, in scala it's possible to load custom objects from > avro files to RDDs this way: > {code} > ctx.hadoopFile("/path/to/the/avro/file.avro", > classOf[AvroInputFormat[MyClassInAvroFile]], > classOf[AvroWrapper[MyClassInAvroFile]], > classOf[NullWritable]) > {code} > I'm not a scala developer, so I tried to "translate" this to java as best I > could. I created classes that extend AvroKey and FileInputFormat: > {code} > public static class MyCustomAvroKey extends AvroKey{}; > public static class MyCustomAvroReader extends > AvroRecordReaderBase{ > // with my custom schema and all the required methods... > } > public static class MyCustomInputFormat extends > FileInputFormat { > @Override > public RecordReader > createRecordReader(InputSplit inputSplit, TaskAttemptContext > taskAttemptContext) throws IOException, InterruptedException { > return new MyCustomAvroReader(); > } > } > ... > JavaPairRDD records = > sc.newAPIHadoopFile("file:/path/to/datafile.avro", > MyCustomInputFormat.class, MyCustomAvroKey.class, > NullWritable.class, > sc.hadoopConfiguration()); > MyCustomClass first = records.first()._1.datum(); > System.out.println("Got a result, some custom field: " + > first.getSomeCustomField()); > {code} > This compiles fine, but using a debugger I can see that `first._1.datum()` > actually returns a `GenericData$Record` in runtime, not a `MyCustomClass` > instance. > And indeed, when the following line executes: > {code} > MyCustomClass first = records.first()._1.datum(); > {code} > I get an exception: > {code} > java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record > cannot be cast to my.package.containing.MyCustomClass > {code} > Am I doing it wrong? Or is this not possible in Java? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-19656) Can't load custom type from avro file to RDD with newAPIHadoopFile
[ https://issues.apache.org/jira/browse/SPARK-19656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15896286#comment-15896286 ] Nira Amit edited comment on SPARK-19656 at 3/5/17 2:56 PM: --- But then why does the compiler accept what newAPIHadoopFile returns as MyCustomClass? If what you are saying is correct, then the only return type that should be accepted is a GenericData$Record or something that can be casted to it. was (Author: amitnira): But then why does the compiler accept what newAPIHadoopFile returns as MyCustomClass? If what you are saying is correct, than the only return type that should be acceptable is a GenericData$Record or something that can be casted to it. > Can't load custom type from avro file to RDD with newAPIHadoopFile > -- > > Key: SPARK-19656 > URL: https://issues.apache.org/jira/browse/SPARK-19656 > Project: Spark > Issue Type: Question > Components: Java API >Affects Versions: 2.0.2 >Reporter: Nira Amit > > If I understand correctly, in scala it's possible to load custom objects from > avro files to RDDs this way: > {code} > ctx.hadoopFile("/path/to/the/avro/file.avro", > classOf[AvroInputFormat[MyClassInAvroFile]], > classOf[AvroWrapper[MyClassInAvroFile]], > classOf[NullWritable]) > {code} > I'm not a scala developer, so I tried to "translate" this to java as best I > could. I created classes that extend AvroKey and FileInputFormat: > {code} > public static class MyCustomAvroKey extends AvroKey{}; > public static class MyCustomAvroReader extends > AvroRecordReaderBase{ > // with my custom schema and all the required methods... > } > public static class MyCustomInputFormat extends > FileInputFormat { > @Override > public RecordReader > createRecordReader(InputSplit inputSplit, TaskAttemptContext > taskAttemptContext) throws IOException, InterruptedException { > return new MyCustomAvroReader(); > } > } > ... > JavaPairRDD records = > sc.newAPIHadoopFile("file:/path/to/datafile.avro", > MyCustomInputFormat.class, MyCustomAvroKey.class, > NullWritable.class, > sc.hadoopConfiguration()); > MyCustomClass first = records.first()._1.datum(); > System.out.println("Got a result, some custom field: " + > first.getSomeCustomField()); > {code} > This compiles fine, but using a debugger I can see that `first._1.datum()` > actually returns a `GenericData$Record` in runtime, not a `MyCustomClass` > instance. > And indeed, when the following line executes: > {code} > MyCustomClass first = records.first()._1.datum(); > {code} > I get an exception: > {code} > java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record > cannot be cast to my.package.containing.MyCustomClass > {code} > Am I doing it wrong? Or is this not possible in Java? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19656) Can't load custom type from avro file to RDD with newAPIHadoopFile
[ https://issues.apache.org/jira/browse/SPARK-19656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15896286#comment-15896286 ] Nira Amit commented on SPARK-19656: --- But then why does the compiler accepts what newAPIHadoopFile returns as MyCustomClass? If what you are saying is correct, than the only return type that should be acceptable is a GenericData$Record or something that can be casted to it. > Can't load custom type from avro file to RDD with newAPIHadoopFile > -- > > Key: SPARK-19656 > URL: https://issues.apache.org/jira/browse/SPARK-19656 > Project: Spark > Issue Type: Question > Components: Java API >Affects Versions: 2.0.2 >Reporter: Nira Amit > > If I understand correctly, in scala it's possible to load custom objects from > avro files to RDDs this way: > {code} > ctx.hadoopFile("/path/to/the/avro/file.avro", > classOf[AvroInputFormat[MyClassInAvroFile]], > classOf[AvroWrapper[MyClassInAvroFile]], > classOf[NullWritable]) > {code} > I'm not a scala developer, so I tried to "translate" this to java as best I > could. I created classes that extend AvroKey and FileInputFormat: > {code} > public static class MyCustomAvroKey extends AvroKey{}; > public static class MyCustomAvroReader extends > AvroRecordReaderBase{ > // with my custom schema and all the required methods... > } > public static class MyCustomInputFormat extends > FileInputFormat { > @Override > public RecordReader > createRecordReader(InputSplit inputSplit, TaskAttemptContext > taskAttemptContext) throws IOException, InterruptedException { > return new MyCustomAvroReader(); > } > } > ... > JavaPairRDD records = > sc.newAPIHadoopFile("file:/path/to/datafile.avro", > MyCustomInputFormat.class, MyCustomAvroKey.class, > NullWritable.class, > sc.hadoopConfiguration()); > MyCustomClass first = records.first()._1.datum(); > System.out.println("Got a result, some custom field: " + > first.getSomeCustomField()); > {code} > This compiles fine, but using a debugger I can see that `first._1.datum()` > actually returns a `GenericData$Record` in runtime, not a `MyCustomClass` > instance. > And indeed, when the following line executes: > {code} > MyCustomClass first = records.first()._1.datum(); > {code} > I get an exception: > {code} > java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record > cannot be cast to my.package.containing.MyCustomClass > {code} > Am I doing it wrong? Or is this not possible in Java? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19656) Can't load custom type from avro file to RDD with newAPIHadoopFile
[ https://issues.apache.org/jira/browse/SPARK-19656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15896282#comment-15896282 ] Sean Owen commented on SPARK-19656: --- Well, at the least, I'd suggest posting a more compilable example. But the last point is I think your problem: you are correctly getting a GenericData$Record because that is what is in the file. You need to call methods on that object to get your type object out. That's an Avro usage issue in your code. You need to investigate that before opening a JIRA. > Can't load custom type from avro file to RDD with newAPIHadoopFile > -- > > Key: SPARK-19656 > URL: https://issues.apache.org/jira/browse/SPARK-19656 > Project: Spark > Issue Type: Question > Components: Java API >Affects Versions: 2.0.2 >Reporter: Nira Amit > > If I understand correctly, in scala it's possible to load custom objects from > avro files to RDDs this way: > {code} > ctx.hadoopFile("/path/to/the/avro/file.avro", > classOf[AvroInputFormat[MyClassInAvroFile]], > classOf[AvroWrapper[MyClassInAvroFile]], > classOf[NullWritable]) > {code} > I'm not a scala developer, so I tried to "translate" this to java as best I > could. I created classes that extend AvroKey and FileInputFormat: > {code} > public static class MyCustomAvroKey extends AvroKey{}; > public static class MyCustomAvroReader extends > AvroRecordReaderBase{ > // with my custom schema and all the required methods... > } > public static class MyCustomInputFormat extends > FileInputFormat { > @Override > public RecordReader > createRecordReader(InputSplit inputSplit, TaskAttemptContext > taskAttemptContext) throws IOException, InterruptedException { > return new MyCustomAvroReader(); > } > } > ... > JavaPairRDD records = > sc.newAPIHadoopFile("file:/path/to/datafile.avro", > MyCustomInputFormat.class, MyCustomAvroKey.class, > NullWritable.class, > sc.hadoopConfiguration()); > MyCustomClass first = records.first()._1.datum(); > System.out.println("Got a result, some custom field: " + > first.getSomeCustomField()); > {code} > This compiles fine, but using a debugger I can see that `first._1.datum()` > actually returns a `GenericData$Record` in runtime, not a `MyCustomClass` > instance. > And indeed, when the following line executes: > {code} > MyCustomClass first = records.first()._1.datum(); > {code} > I get an exception: > {code} > java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record > cannot be cast to my.package.containing.MyCustomClass > {code} > Am I doing it wrong? Or is this not possible in Java? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19656) Can't load custom type from avro file to RDD with newAPIHadoopFile
[ https://issues.apache.org/jira/browse/SPARK-19656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15896277#comment-15896277 ] Nira Amit commented on SPARK-19656: --- The only reason my code sample doesn't compile is because it doesn't include my actual custom class implementation. Otherwise it's a copy-paste of my valid code which compiles, runs, and then crushes due to a RuntimeException. And that is because the class it's getting in runtime isn't what the compiler gave it. I understand that the solution is to migrate my code to Datasets. But this seems like a problem in the newAPIHadoopFile API. > Can't load custom type from avro file to RDD with newAPIHadoopFile > -- > > Key: SPARK-19656 > URL: https://issues.apache.org/jira/browse/SPARK-19656 > Project: Spark > Issue Type: Question > Components: Java API >Affects Versions: 2.0.2 >Reporter: Nira Amit > > If I understand correctly, in scala it's possible to load custom objects from > avro files to RDDs this way: > {code} > ctx.hadoopFile("/path/to/the/avro/file.avro", > classOf[AvroInputFormat[MyClassInAvroFile]], > classOf[AvroWrapper[MyClassInAvroFile]], > classOf[NullWritable]) > {code} > I'm not a scala developer, so I tried to "translate" this to java as best I > could. I created classes that extend AvroKey and FileInputFormat: > {code} > public static class MyCustomAvroKey extends AvroKey{}; > public static class MyCustomAvroReader extends > AvroRecordReaderBase{ > // with my custom schema and all the required methods... > } > public static class MyCustomInputFormat extends > FileInputFormat { > @Override > public RecordReader > createRecordReader(InputSplit inputSplit, TaskAttemptContext > taskAttemptContext) throws IOException, InterruptedException { > return new MyCustomAvroReader(); > } > } > ... > JavaPairRDD records = > sc.newAPIHadoopFile("file:/path/to/datafile.avro", > MyCustomInputFormat.class, MyCustomAvroKey.class, > NullWritable.class, > sc.hadoopConfiguration()); > MyCustomClass first = records.first()._1.datum(); > System.out.println("Got a result, some custom field: " + > first.getSomeCustomField()); > {code} > This compiles fine, but using a debugger I can see that `first._1.datum()` > actually returns a `GenericData$Record` in runtime, not a `MyCustomClass` > instance. > And indeed, when the following line executes: > {code} > MyCustomClass first = records.first()._1.datum(); > {code} > I get an exception: > {code} > java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record > cannot be cast to my.package.containing.MyCustomClass > {code} > Am I doing it wrong? Or is this not possible in Java? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19656) Can't load custom type from avro file to RDD with newAPIHadoopFile
[ https://issues.apache.org/jira/browse/SPARK-19656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15896272#comment-15896272 ] Sean Owen commented on SPARK-19656: --- PS I should be concrete about why I think the original code doesn't work -- it doesn't compile because you're using newAPIHadoopFile whereas the example you follow uses hadoopFile. If you adjusted that, then I think you're getting back an Avro GenericRecord as expected. Avro has its own records in a file, not your objects. You need to get() your type out of it? But that's an issue in your code. I think the reason this went to DataFrame / Dataset is that there is first-class support for Avro there where your types get unpacked. That's the righter way to do this anyway, although, shouldn't be much reason you can't do this with RDDs if you must. > Can't load custom type from avro file to RDD with newAPIHadoopFile > -- > > Key: SPARK-19656 > URL: https://issues.apache.org/jira/browse/SPARK-19656 > Project: Spark > Issue Type: Question > Components: Java API >Affects Versions: 2.0.2 >Reporter: Nira Amit > > If I understand correctly, in scala it's possible to load custom objects from > avro files to RDDs this way: > {code} > ctx.hadoopFile("/path/to/the/avro/file.avro", > classOf[AvroInputFormat[MyClassInAvroFile]], > classOf[AvroWrapper[MyClassInAvroFile]], > classOf[NullWritable]) > {code} > I'm not a scala developer, so I tried to "translate" this to java as best I > could. I created classes that extend AvroKey and FileInputFormat: > {code} > public static class MyCustomAvroKey extends AvroKey{}; > public static class MyCustomAvroReader extends > AvroRecordReaderBase{ > // with my custom schema and all the required methods... > } > public static class MyCustomInputFormat extends > FileInputFormat { > @Override > public RecordReader > createRecordReader(InputSplit inputSplit, TaskAttemptContext > taskAttemptContext) throws IOException, InterruptedException { > return new MyCustomAvroReader(); > } > } > ... > JavaPairRDD records = > sc.newAPIHadoopFile("file:/path/to/datafile.avro", > MyCustomInputFormat.class, MyCustomAvroKey.class, > NullWritable.class, > sc.hadoopConfiguration()); > MyCustomClass first = records.first()._1.datum(); > System.out.println("Got a result, some custom field: " + > first.getSomeCustomField()); > {code} > This compiles fine, but using a debugger I can see that `first._1.datum()` > actually returns a `GenericData$Record` in runtime, not a `MyCustomClass` > instance. > And indeed, when the following line executes: > {code} > MyCustomClass first = records.first()._1.datum(); > {code} > I get an exception: > {code} > java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record > cannot be cast to my.package.containing.MyCustomClass > {code} > Am I doing it wrong? Or is this not possible in Java? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19656) Can't load custom type from avro file to RDD with newAPIHadoopFile
[ https://issues.apache.org/jira/browse/SPARK-19656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15896268#comment-15896268 ] Sean Owen commented on SPARK-19656: --- Yes, I just tried to compile your code example above, and it doesn't work, but, for more basic reasons. That much is "Not A Problem" because you've got more basic usage errors. That is, this is not an example of code that should work but doesn't due to Avro issues. To the additional narrower question of whether Datasets and casting works, it does, and I verified that it compiles. > Can't load custom type from avro file to RDD with newAPIHadoopFile > -- > > Key: SPARK-19656 > URL: https://issues.apache.org/jira/browse/SPARK-19656 > Project: Spark > Issue Type: Question > Components: Java API >Affects Versions: 2.0.2 >Reporter: Nira Amit > > If I understand correctly, in scala it's possible to load custom objects from > avro files to RDDs this way: > {code} > ctx.hadoopFile("/path/to/the/avro/file.avro", > classOf[AvroInputFormat[MyClassInAvroFile]], > classOf[AvroWrapper[MyClassInAvroFile]], > classOf[NullWritable]) > {code} > I'm not a scala developer, so I tried to "translate" this to java as best I > could. I created classes that extend AvroKey and FileInputFormat: > {code} > public static class MyCustomAvroKey extends AvroKey{}; > public static class MyCustomAvroReader extends > AvroRecordReaderBase{ > // with my custom schema and all the required methods... > } > public static class MyCustomInputFormat extends > FileInputFormat { > @Override > public RecordReader > createRecordReader(InputSplit inputSplit, TaskAttemptContext > taskAttemptContext) throws IOException, InterruptedException { > return new MyCustomAvroReader(); > } > } > ... > JavaPairRDD records = > sc.newAPIHadoopFile("file:/path/to/datafile.avro", > MyCustomInputFormat.class, MyCustomAvroKey.class, > NullWritable.class, > sc.hadoopConfiguration()); > MyCustomClass first = records.first()._1.datum(); > System.out.println("Got a result, some custom field: " + > first.getSomeCustomField()); > {code} > This compiles fine, but using a debugger I can see that `first._1.datum()` > actually returns a `GenericData$Record` in runtime, not a `MyCustomClass` > instance. > And indeed, when the following line executes: > {code} > MyCustomClass first = records.first()._1.datum(); > {code} > I get an exception: > {code} > java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record > cannot be cast to my.package.containing.MyCustomClass > {code} > Am I doing it wrong? Or is this not possible in Java? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19656) Can't load custom type from avro file to RDD with newAPIHadoopFile
[ https://issues.apache.org/jira/browse/SPARK-19656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15896266#comment-15896266 ] Nira Amit commented on SPARK-19656: --- [~sowen] I have been trying this for weeks every way I could possibly think of. Have you (really) tried any of my code samples? With RDDs, not Datasets? If this is not possible with the newAPIHadoopFile then it's not an issue for mailing lists but rather be mentioned explicitly in the documentation because apparently I'm not the only one who expected this to work and couldn't figure out what I'm doing wrong. > Can't load custom type from avro file to RDD with newAPIHadoopFile > -- > > Key: SPARK-19656 > URL: https://issues.apache.org/jira/browse/SPARK-19656 > Project: Spark > Issue Type: Question > Components: Java API >Affects Versions: 2.0.2 >Reporter: Nira Amit > > If I understand correctly, in scala it's possible to load custom objects from > avro files to RDDs this way: > {code} > ctx.hadoopFile("/path/to/the/avro/file.avro", > classOf[AvroInputFormat[MyClassInAvroFile]], > classOf[AvroWrapper[MyClassInAvroFile]], > classOf[NullWritable]) > {code} > I'm not a scala developer, so I tried to "translate" this to java as best I > could. I created classes that extend AvroKey and FileInputFormat: > {code} > public static class MyCustomAvroKey extends AvroKey{}; > public static class MyCustomAvroReader extends > AvroRecordReaderBase{ > // with my custom schema and all the required methods... > } > public static class MyCustomInputFormat extends > FileInputFormat { > @Override > public RecordReader > createRecordReader(InputSplit inputSplit, TaskAttemptContext > taskAttemptContext) throws IOException, InterruptedException { > return new MyCustomAvroReader(); > } > } > ... > JavaPairRDD records = > sc.newAPIHadoopFile("file:/path/to/datafile.avro", > MyCustomInputFormat.class, MyCustomAvroKey.class, > NullWritable.class, > sc.hadoopConfiguration()); > MyCustomClass first = records.first()._1.datum(); > System.out.println("Got a result, some custom field: " + > first.getSomeCustomField()); > {code} > This compiles fine, but using a debugger I can see that `first._1.datum()` > actually returns a `GenericData$Record` in runtime, not a `MyCustomClass` > instance. > And indeed, when the following line executes: > {code} > MyCustomClass first = records.first()._1.datum(); > {code} > I get an exception: > {code} > java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record > cannot be cast to my.package.containing.MyCustomClass > {code} > Am I doing it wrong? Or is this not possible in Java? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19656) Can't load custom type from avro file to RDD with newAPIHadoopFile
[ https://issues.apache.org/jira/browse/SPARK-19656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15896262#comment-15896262 ] Nira Amit commented on SPARK-19656: --- Thanks Eric, but my question is about RDDs. Is it correct that it is not possible to load custom classes directly to RDDs in Java? Only to Dataframes? > Can't load custom type from avro file to RDD with newAPIHadoopFile > -- > > Key: SPARK-19656 > URL: https://issues.apache.org/jira/browse/SPARK-19656 > Project: Spark > Issue Type: Question > Components: Java API >Affects Versions: 2.0.2 >Reporter: Nira Amit > > If I understand correctly, in scala it's possible to load custom objects from > avro files to RDDs this way: > {code} > ctx.hadoopFile("/path/to/the/avro/file.avro", > classOf[AvroInputFormat[MyClassInAvroFile]], > classOf[AvroWrapper[MyClassInAvroFile]], > classOf[NullWritable]) > {code} > I'm not a scala developer, so I tried to "translate" this to java as best I > could. I created classes that extend AvroKey and FileInputFormat: > {code} > public static class MyCustomAvroKey extends AvroKey{}; > public static class MyCustomAvroReader extends > AvroRecordReaderBase{ > // with my custom schema and all the required methods... > } > public static class MyCustomInputFormat extends > FileInputFormat { > @Override > public RecordReader > createRecordReader(InputSplit inputSplit, TaskAttemptContext > taskAttemptContext) throws IOException, InterruptedException { > return new MyCustomAvroReader(); > } > } > ... > JavaPairRDD records = > sc.newAPIHadoopFile("file:/path/to/datafile.avro", > MyCustomInputFormat.class, MyCustomAvroKey.class, > NullWritable.class, > sc.hadoopConfiguration()); > MyCustomClass first = records.first()._1.datum(); > System.out.println("Got a result, some custom field: " + > first.getSomeCustomField()); > {code} > This compiles fine, but using a debugger I can see that `first._1.datum()` > actually returns a `GenericData$Record` in runtime, not a `MyCustomClass` > instance. > And indeed, when the following line executes: > {code} > MyCustomClass first = records.first()._1.datum(); > {code} > I get an exception: > {code} > java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record > cannot be cast to my.package.containing.MyCustomClass > {code} > Am I doing it wrong? Or is this not possible in Java? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16102) Use Record API from Univocity rather than current data cast API.
[ https://issues.apache.org/jira/browse/SPARK-16102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15896247#comment-15896247 ] Hyukjin Kwon commented on SPARK-16102: -- This takes longer than I thought. Let me update this soon. > Use Record API from Univocity rather than current data cast API. > > > Key: SPARK-16102 > URL: https://issues.apache.org/jira/browse/SPARK-16102 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.0.0 >Reporter: Hyukjin Kwon > > There is Record API for Univocity parser. > This API provides typed data. Spark currently tries to compare and cast each > data. > Using this library should reduce the codes in Spark and maybe improve the > performance. > It seems a benchmark should be proceeded first. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19254) Support Seq, Map, and Struct in functions.lit
[ https://issues.apache.org/jira/browse/SPARK-19254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hovell resolved SPARK-19254. --- Resolution: Fixed Assignee: Takeshi Yamamuro Fix Version/s: 2.2.0 > Support Seq, Map, and Struct in functions.lit > - > > Key: SPARK-19254 > URL: https://issues.apache.org/jira/browse/SPARK-19254 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Takeshi Yamamuro >Assignee: Takeshi Yamamuro >Priority: Minor > Fix For: 2.2.0 > > > In the current implementation, function.lit does not support Seq, Map, and > Struct. This ticket is intended to support them. This is the follow-up of > https://issues.apache.org/jira/browse/SPARK-17683. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-19656) Can't load custom type from avro file to RDD with newAPIHadoopFile
[ https://issues.apache.org/jira/browse/SPARK-19656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15896194#comment-15896194 ] Eric Maynard edited comment on SPARK-19656 at 3/5/17 11:51 AM: --- Here is a complete working example in Java: {code:title=AvroTest.java|borderStyle=solid} public class AvroTest { public static void main(String[] args){ //build spark session: System.setProperty("hadoop.home.dir", "C:\\Hadoop");//windows hack SparkSession spark = SparkSession.builder().master("local").appName("Avro Test") .config("spark.sql.warehouse.dir", "file:///c:/tmp/spark-warehouse")//another windows hack .getOrCreate(); //create data: ArrayList list = new ArrayList(); CustomClass cc = new CustomClass(); cc.setA(5); cc.setB(6); list.add(cc); spark.createDataFrame(list, CustomClass.class).write().mode(SaveMode.Overwrite).format("com.databricks.spark.avro").save("C:\\tmp\\file.avro"); //read data: Row row = (spark.read().format("com.databricks.spark.avro").load("C:\\tmp\\file.avro").head()); System.out.println(row); System.out.println(row.get(0)); System.out.println(row.get(1)); System.out.println("Success =\t" + ((Integer)row.get(0) == 5)); } } {code} With a simple custom class: {code:title=CustomClass.java|borderStyle=solid} import java.io.Serializable; public class CustomClass implements Serializable { private int a; public void setA(int value){this.a = value;} public int getA(){return this.a;} private int b; public void setB(int value) {this.b = value;} public int getB(){return this.b;} } {code} Everything looks ok to me, and after running stdout looks like this: {code} [5,6] 5 6 Success = true {code} In the future please make sure that you don't have an issue in your application before opening a JIRA. Also, as an aside, I really recommend picking up some Scala as IMO the Scala API is much friendlier, esp. around the edges for things like the avro library. was (Author: emaynard): Here is a complete working example in Java: {code:title=AvroTest.java|borderStyle=solid} import org.apache.spark.sql.Row; import org.apache.spark.sql.SparkSession; import java.util.ArrayList; public class AvroTest { public static void main(String[] args){ //build spark session: System.setProperty("hadoop.home.dir", "C:\\Hadoop");//windows hack SparkSession spark = SparkSession.builder().master("local").appName("Avro Test") .config("spark.sql.warehouse.dir", "file:///c:/tmp/spark-warehouse")//another windows hack .getOrCreate(); //create data: ArrayList list = new ArrayList(); CustomClass cc = new CustomClass(); cc.setValue(5); list.add(cc); spark.createDataFrame(list, CustomClass.class).write().format("com.databricks.spark.avro").save("C:\\tmp\\file.avro"); //read data: Row row = (spark.read().format("com.databricks.spark.avro").load("C:\\tmp\\file.avro").head()); System.out.println("Success =\t" + ((Integer)row.get(0) == 5)); } } {code} With a simple custom class: {code:title=CustomClass.java|borderStyle=solid} import java.io.Serializable; public class CustomClass implements Serializable { public int value; public void setValue(int value){this.value = value;} public int getValue(){return this.value;} } {code} Everything looks ok to me, and the main function prints "Success = true". In the future please make sure that you don't have an issue in your application before opening a JIRA. Also, as an aside, I really recommend picking up some Scala as IMO the Scala API is much friendlier, esp. around the edges for things like the avro library. > Can't load custom type from avro file to RDD with newAPIHadoopFile > -- > > Key: SPARK-19656 > URL: https://issues.apache.org/jira/browse/SPARK-19656 > Project: Spark > Issue Type: Question > Components: Java API >Affects Versions: 2.0.2 >Reporter: Nira Amit > > If I understand correctly, in scala it's possible to load custom objects from > avro files to RDDs this way: > {code} > ctx.hadoopFile("/path/to/the/avro/file.avro", > classOf[AvroInputFormat[MyClassInAvroFile]], > classOf[AvroWrapper[MyClassInAvroFile]], > classOf[NullWritable]) > {code} > I'm not a scala developer, so I tried to "translate" this to java as best I > could. I created classes that extend AvroKey and FileInputFormat: > {code} > public static class MyCustomAvroKey extends AvroKey{}; > public static class MyCustomAvroReader extends > AvroRecordReaderBase
[jira] [Commented] (SPARK-19656) Can't load custom type from avro file to RDD with newAPIHadoopFile
[ https://issues.apache.org/jira/browse/SPARK-19656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15896194#comment-15896194 ] Eric Maynard commented on SPARK-19656: -- Here is a complete working example in Java: {code:title=AvroTest.java|borderStyle=solid} import org.apache.spark.sql.Row; import org.apache.spark.sql.SparkSession; import java.util.ArrayList; public class AvroTest { public static void main(String[] args){ //build spark session: System.setProperty("hadoop.home.dir", "C:\\Hadoop");//windows hack SparkSession spark = SparkSession.builder().master("local").appName("Avro Test") .config("spark.sql.warehouse.dir", "file:///c:/tmp/spark-warehouse")//another windows hack .getOrCreate(); //create data: ArrayList list = new ArrayList(); CustomClass cc = new CustomClass(); cc.setValue(5); list.add(cc); spark.createDataFrame(list, CustomClass.class).write().format("com.databricks.spark.avro").save("C:\\tmp\\file.avro"); //read data: Row row = (spark.read().format("com.databricks.spark.avro").load("C:\\tmp\\file.avro").head()); System.out.println("Success =\t" + ((Integer)row.get(0) == 5)); } } {code} With a simple custom class: {code:title=CustomClass.java|borderStyle=solid} import java.io.Serializable; public class CustomClass implements Serializable { public int value; public void setValue(int value){this.value = value;} public int getValue(){return this.value;} } {code} Everything looks ok to me, and the main function prints "Success = true". In the future please make sure that you don't have an issue in your application before opening a JIRA. Also, as an aside, I really recommend picking up some Scala as IMO the Scala API is much friendlier, esp. around the edges for things like the avro library. > Can't load custom type from avro file to RDD with newAPIHadoopFile > -- > > Key: SPARK-19656 > URL: https://issues.apache.org/jira/browse/SPARK-19656 > Project: Spark > Issue Type: Question > Components: Java API >Affects Versions: 2.0.2 >Reporter: Nira Amit > > If I understand correctly, in scala it's possible to load custom objects from > avro files to RDDs this way: > {code} > ctx.hadoopFile("/path/to/the/avro/file.avro", > classOf[AvroInputFormat[MyClassInAvroFile]], > classOf[AvroWrapper[MyClassInAvroFile]], > classOf[NullWritable]) > {code} > I'm not a scala developer, so I tried to "translate" this to java as best I > could. I created classes that extend AvroKey and FileInputFormat: > {code} > public static class MyCustomAvroKey extends AvroKey{}; > public static class MyCustomAvroReader extends > AvroRecordReaderBase{ > // with my custom schema and all the required methods... > } > public static class MyCustomInputFormat extends > FileInputFormat { > @Override > public RecordReader > createRecordReader(InputSplit inputSplit, TaskAttemptContext > taskAttemptContext) throws IOException, InterruptedException { > return new MyCustomAvroReader(); > } > } > ... > JavaPairRDD records = > sc.newAPIHadoopFile("file:/path/to/datafile.avro", > MyCustomInputFormat.class, MyCustomAvroKey.class, > NullWritable.class, > sc.hadoopConfiguration()); > MyCustomClass first = records.first()._1.datum(); > System.out.println("Got a result, some custom field: " + > first.getSomeCustomField()); > {code} > This compiles fine, but using a debugger I can see that `first._1.datum()` > actually returns a `GenericData$Record` in runtime, not a `MyCustomClass` > instance. > And indeed, when the following line executes: > {code} > MyCustomClass first = records.first()._1.datum(); > {code} > I get an exception: > {code} > java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record > cannot be cast to my.package.containing.MyCustomClass > {code} > Am I doing it wrong? Or is this not possible in Java? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19824) Standalone master JSON not showing cores for running applications
[ https://issues.apache.org/jira/browse/SPARK-19824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15896193#comment-15896193 ] Sean Owen commented on SPARK-19824: --- I guess it doesn't show "memory per executor" either? That came up yesterday. I don't know the standalone master well but it does look like this is in the UI. I'm not sure it has to be in the JSON but seems reasonable to be consistent. > Standalone master JSON not showing cores for running applications > - > > Key: SPARK-19824 > URL: https://issues.apache.org/jira/browse/SPARK-19824 > Project: Spark > Issue Type: Bug > Components: Deploy >Affects Versions: 2.1.0 >Reporter: Dan >Priority: Minor > > The JSON API of the standalone master ("/json") does not show the number of > cores for a running application, which is available on the UI. > "activeapps" : [ { > "starttime" : 1488702337788, > "id" : "app-20170305102537-19717", > "name" : "POPAI_Aggregated", > "user" : "ibiuser", > "memoryperslave" : 16384, > "submitdate" : "Sun Mar 05 10:25:37 IST 2017", > "state" : "RUNNING", > "duration" : 1141934 > } ], -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19805) Log the row type when query result dose not match
[ https://issues.apache.org/jira/browse/SPARK-19805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hovell resolved SPARK-19805. --- Resolution: Fixed Assignee: Genmao Yu Fix Version/s: 2.2.0 > Log the row type when query result dose not match > - > > Key: SPARK-19805 > URL: https://issues.apache.org/jira/browse/SPARK-19805 > Project: Spark > Issue Type: Improvement > Components: Tests >Affects Versions: 2.0.2, 2.1.0 >Reporter: Genmao Yu >Assignee: Genmao Yu >Priority: Minor > Fix For: 2.2.0 > > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19823) Support Gang Distribution of Task
[ https://issues.apache.org/jira/browse/SPARK-19823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15896192#comment-15896192 ] Sean Owen commented on SPARK-19823: --- I don't think that's quite right. Task assignment takes into account locality, for example. Choosing to override locality preferences of the scheduler and batch-assign tasks to executors that are suboptimal for locality has drawbacks. You will generally observe that executors that are local to data that jobs need will get more tasks, and those that aren't will already be less busy, and therefore more quickly become idle and terminate. The effect you want already takes place indirectly. > Support Gang Distribution of Task > -- > > Key: SPARK-19823 > URL: https://issues.apache.org/jira/browse/SPARK-19823 > Project: Spark > Issue Type: Improvement > Components: Scheduler, Spark Core >Affects Versions: 2.0.2 >Reporter: DjvuLee > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6407) Streaming ALS for Collaborative Filtering
[ https://issues.apache.org/jira/browse/SPARK-6407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15896190#comment-15896190 ] Sean Owen commented on SPARK-6407: -- I did some work on this, but it's not a paper or anything, just some code, in and around these bits of code, which try to compute new user/item updates on the fly: https://github.com/OryxProject/oryx/blob/master/app/oryx-app/src/main/java/com/cloudera/oryx/app/speed/als/ALSSpeedModelManager.java#L198 https://github.com/OryxProject/oryx/blob/master/app/oryx-app-common/src/main/java/com/cloudera/oryx/app/als/ALSUtils.java The choices about the semantics of the updates are in ALSUtils. If you dig into it, we can discuss offline and I can probably write more in the docs to make it clearer what's happening. > Streaming ALS for Collaborative Filtering > - > > Key: SPARK-6407 > URL: https://issues.apache.org/jira/browse/SPARK-6407 > Project: Spark > Issue Type: New Feature > Components: DStreams >Reporter: Felix Cheung >Priority: Minor > > Like MLLib's ALS implementation for recommendation, and applying to streaming. > Similar to streaming linear regression, logistic regression, could we apply > gradient updates to batches of data and reuse existing MLLib implementation? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19656) Can't load custom type from avro file to RDD with newAPIHadoopFile
[ https://issues.apache.org/jira/browse/SPARK-19656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15896186#comment-15896186 ] Sean Owen commented on SPARK-19656: --- I guess I mean, have you really tried it? it doesn't result in a compile error, and you didn't say what the compile error is. This works, yes: {code} public static class Foo {} ... Dataset ds = spark.read; ds.map((MapFunction) row -> (Foo) row.get(0), new MyFooEncoder()); ... {code} Meaning, the cast in question works and you can map to a new Dataset if you have an encoder for your type. The rest of the example you provide above doesn't work; it looks like a Hadoop API version problem. That's up to your code though. You're trying to use old Hadoop API Avro classes with newAPIHadoopFile. This should be on the mailing list until it's narrowed down. > Can't load custom type from avro file to RDD with newAPIHadoopFile > -- > > Key: SPARK-19656 > URL: https://issues.apache.org/jira/browse/SPARK-19656 > Project: Spark > Issue Type: Question > Components: Java API >Affects Versions: 2.0.2 >Reporter: Nira Amit > > If I understand correctly, in scala it's possible to load custom objects from > avro files to RDDs this way: > {code} > ctx.hadoopFile("/path/to/the/avro/file.avro", > classOf[AvroInputFormat[MyClassInAvroFile]], > classOf[AvroWrapper[MyClassInAvroFile]], > classOf[NullWritable]) > {code} > I'm not a scala developer, so I tried to "translate" this to java as best I > could. I created classes that extend AvroKey and FileInputFormat: > {code} > public static class MyCustomAvroKey extends AvroKey{}; > public static class MyCustomAvroReader extends > AvroRecordReaderBase
{ > // with my custom schema and all the required methods... > } > public static class MyCustomInputFormat extends > FileInputFormat { > @Override > public RecordReader > createRecordReader(InputSplit inputSplit, TaskAttemptContext > taskAttemptContext) throws IOException, InterruptedException { > return new MyCustomAvroReader(); > } > } > ... > JavaPairRDD records = > sc.newAPIHadoopFile("file:/path/to/datafile.avro", > MyCustomInputFormat.class, MyCustomAvroKey.class, > NullWritable.class, > sc.hadoopConfiguration()); > MyCustomClass first = records.first()._1.datum(); > System.out.println("Got a result, some custom field: " + > first.getSomeCustomField()); > {code} > This compiles fine, but using a debugger I can see that `first._1.datum()` > actually returns a `GenericData$Record` in runtime, not a `MyCustomClass` > instance. > And indeed, when the following line executes: > {code} > MyCustomClass first = records.first()._1.datum(); > {code} > I get an exception: > {code} > java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record > cannot be cast to my.package.containing.MyCustomClass > {code} > Am I doing it wrong? Or is this not possible in Java? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19656) Can't load custom type from avro file to RDD with newAPIHadoopFile
[ https://issues.apache.org/jira/browse/SPARK-19656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15896179#comment-15896179 ] Nira Amit commented on SPARK-19656: --- Yes, I did, and answered him that it gives a compilation error in Java. Have you tried doing this in Java? If this is possible then please give a working code example, there should be no discussion if there is a correct way of doing this. > Can't load custom type from avro file to RDD with newAPIHadoopFile > -- > > Key: SPARK-19656 > URL: https://issues.apache.org/jira/browse/SPARK-19656 > Project: Spark > Issue Type: Question > Components: Java API >Affects Versions: 2.0.2 >Reporter: Nira Amit > > If I understand correctly, in scala it's possible to load custom objects from > avro files to RDDs this way: > {code} > ctx.hadoopFile("/path/to/the/avro/file.avro", > classOf[AvroInputFormat[MyClassInAvroFile]], > classOf[AvroWrapper[MyClassInAvroFile]], > classOf[NullWritable]) > {code} > I'm not a scala developer, so I tried to "translate" this to java as best I > could. I created classes that extend AvroKey and FileInputFormat: > {code} > public static class MyCustomAvroKey extends AvroKey{}; > public static class MyCustomAvroReader extends > AvroRecordReaderBase{ > // with my custom schema and all the required methods... > } > public static class MyCustomInputFormat extends > FileInputFormat { > @Override > public RecordReader > createRecordReader(InputSplit inputSplit, TaskAttemptContext > taskAttemptContext) throws IOException, InterruptedException { > return new MyCustomAvroReader(); > } > } > ... > JavaPairRDD records = > sc.newAPIHadoopFile("file:/path/to/datafile.avro", > MyCustomInputFormat.class, MyCustomAvroKey.class, > NullWritable.class, > sc.hadoopConfiguration()); > MyCustomClass first = records.first()._1.datum(); > System.out.println("Got a result, some custom field: " + > first.getSomeCustomField()); > {code} > This compiles fine, but using a debugger I can see that `first._1.datum()` > actually returns a `GenericData$Record` in runtime, not a `MyCustomClass` > instance. > And indeed, when the following line executes: > {code} > MyCustomClass first = records.first()._1.datum(); > {code} > I get an exception: > {code} > java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record > cannot be cast to my.package.containing.MyCustomClass > {code} > Am I doing it wrong? Or is this not possible in Java? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-6407) Streaming ALS for Collaborative Filtering
[ https://issues.apache.org/jira/browse/SPARK-6407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15896177#comment-15896177 ] Daniel Li edited comment on SPARK-6407 at 3/5/17 10:57 AM: --- {quote} In practice fold-in works fine. Folding in a day or so of updates has been OK. The question isn't RMSE but how it affects actual rankings of items in recommendations, and it takes a while before the effect of the approximation actually changes a rank. {quote} Hmm, I see. This would be something I'd be interested in implementing for Spark if there's need. Are there implementations (or papers) of this you know of that I could look at? was (Author: danielyli): bq. In practice fold-in works fine. Folding in a day or so of updates has been OK. The question isn't RMSE but how it affects actual rankings of items in recommendations, and it takes a while before the effect of the approximation actually changes a rank. Hmm, I see. This would be something I'd be interested in implementing for Spark if there's need. Are there implementations (or papers) of this you know of that I could look at? > Streaming ALS for Collaborative Filtering > - > > Key: SPARK-6407 > URL: https://issues.apache.org/jira/browse/SPARK-6407 > Project: Spark > Issue Type: New Feature > Components: DStreams >Reporter: Felix Cheung >Priority: Minor > > Like MLLib's ALS implementation for recommendation, and applying to streaming. > Similar to streaming linear regression, logistic regression, could we apply > gradient updates to batches of data and reuse existing MLLib implementation? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6407) Streaming ALS for Collaborative Filtering
[ https://issues.apache.org/jira/browse/SPARK-6407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15896177#comment-15896177 ] Daniel Li commented on SPARK-6407: -- bq. In practice fold-in works fine. Folding in a day or so of updates has been OK. The question isn't RMSE but how it affects actual rankings of items in recommendations, and it takes a while before the effect of the approximation actually changes a rank. Hmm, I see. This would be something I'd be interested in implementing for Spark if there's need. Are there implementations (or papers) of this you know of that I could look at? > Streaming ALS for Collaborative Filtering > - > > Key: SPARK-6407 > URL: https://issues.apache.org/jira/browse/SPARK-6407 > Project: Spark > Issue Type: New Feature > Components: DStreams >Reporter: Felix Cheung >Priority: Minor > > Like MLLib's ALS implementation for recommendation, and applying to streaming. > Similar to streaming linear regression, logistic regression, could we apply > gradient updates to batches of data and reuse existing MLLib implementation? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19656) Can't load custom type from avro file to RDD with newAPIHadoopFile
[ https://issues.apache.org/jira/browse/SPARK-19656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15896174#comment-15896174 ] Sean Owen commented on SPARK-19656: --- Have you tried Eric's suggestion? asInstanceOf is just casting in Java. That is the kind of discussion to have on the mailing list and homework to do before opening a JIRA. > Can't load custom type from avro file to RDD with newAPIHadoopFile > -- > > Key: SPARK-19656 > URL: https://issues.apache.org/jira/browse/SPARK-19656 > Project: Spark > Issue Type: Question > Components: Java API >Affects Versions: 2.0.2 >Reporter: Nira Amit > > If I understand correctly, in scala it's possible to load custom objects from > avro files to RDDs this way: > {code} > ctx.hadoopFile("/path/to/the/avro/file.avro", > classOf[AvroInputFormat[MyClassInAvroFile]], > classOf[AvroWrapper[MyClassInAvroFile]], > classOf[NullWritable]) > {code} > I'm not a scala developer, so I tried to "translate" this to java as best I > could. I created classes that extend AvroKey and FileInputFormat: > {code} > public static class MyCustomAvroKey extends AvroKey{}; > public static class MyCustomAvroReader extends > AvroRecordReaderBase{ > // with my custom schema and all the required methods... > } > public static class MyCustomInputFormat extends > FileInputFormat { > @Override > public RecordReader > createRecordReader(InputSplit inputSplit, TaskAttemptContext > taskAttemptContext) throws IOException, InterruptedException { > return new MyCustomAvroReader(); > } > } > ... > JavaPairRDD records = > sc.newAPIHadoopFile("file:/path/to/datafile.avro", > MyCustomInputFormat.class, MyCustomAvroKey.class, > NullWritable.class, > sc.hadoopConfiguration()); > MyCustomClass first = records.first()._1.datum(); > System.out.println("Got a result, some custom field: " + > first.getSomeCustomField()); > {code} > This compiles fine, but using a debugger I can see that `first._1.datum()` > actually returns a `GenericData$Record` in runtime, not a `MyCustomClass` > instance. > And indeed, when the following line executes: > {code} > MyCustomClass first = records.first()._1.datum(); > {code} > I get an exception: > {code} > java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record > cannot be cast to my.package.containing.MyCustomClass > {code} > Am I doing it wrong? Or is this not possible in Java? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19656) Can't load custom type from avro file to RDD with newAPIHadoopFile
[ https://issues.apache.org/jira/browse/SPARK-19656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15896172#comment-15896172 ] Nira Amit commented on SPARK-19656: --- But if this is not possible to do in Java then it IS an actionable change, isn't it? I already posted this question several weeks ago in StackOverflow and got many upvotes but no answer, which is why I posted it in the "Question" category of your Jira. Is it possible in Java or isn't it? From Eric's answer it sounds like it should be, yet nobody seems to know how to do it. > Can't load custom type from avro file to RDD with newAPIHadoopFile > -- > > Key: SPARK-19656 > URL: https://issues.apache.org/jira/browse/SPARK-19656 > Project: Spark > Issue Type: Question > Components: Java API >Affects Versions: 2.0.2 >Reporter: Nira Amit > > If I understand correctly, in scala it's possible to load custom objects from > avro files to RDDs this way: > {code} > ctx.hadoopFile("/path/to/the/avro/file.avro", > classOf[AvroInputFormat[MyClassInAvroFile]], > classOf[AvroWrapper[MyClassInAvroFile]], > classOf[NullWritable]) > {code} > I'm not a scala developer, so I tried to "translate" this to java as best I > could. I created classes that extend AvroKey and FileInputFormat: > {code} > public static class MyCustomAvroKey extends AvroKey{}; > public static class MyCustomAvroReader extends > AvroRecordReaderBase{ > // with my custom schema and all the required methods... > } > public static class MyCustomInputFormat extends > FileInputFormat { > @Override > public RecordReader > createRecordReader(InputSplit inputSplit, TaskAttemptContext > taskAttemptContext) throws IOException, InterruptedException { > return new MyCustomAvroReader(); > } > } > ... > JavaPairRDD records = > sc.newAPIHadoopFile("file:/path/to/datafile.avro", > MyCustomInputFormat.class, MyCustomAvroKey.class, > NullWritable.class, > sc.hadoopConfiguration()); > MyCustomClass first = records.first()._1.datum(); > System.out.println("Got a result, some custom field: " + > first.getSomeCustomField()); > {code} > This compiles fine, but using a debugger I can see that `first._1.datum()` > actually returns a `GenericData$Record` in runtime, not a `MyCustomClass` > instance. > And indeed, when the following line executes: > {code} > MyCustomClass first = records.first()._1.datum(); > {code} > I get an exception: > {code} > java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record > cannot be cast to my.package.containing.MyCustomClass > {code} > Am I doing it wrong? Or is this not possible in Java? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19656) Can't load custom type from avro file to RDD with newAPIHadoopFile
[ https://issues.apache.org/jira/browse/SPARK-19656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15896169#comment-15896169 ] Sean Owen commented on SPARK-19656: --- Mostly, it is that questions should go to the mailing list. I don't keep track of your JIRAs. This should be reserved for actionable changes not questions. > Can't load custom type from avro file to RDD with newAPIHadoopFile > -- > > Key: SPARK-19656 > URL: https://issues.apache.org/jira/browse/SPARK-19656 > Project: Spark > Issue Type: Question > Components: Java API >Affects Versions: 2.0.2 >Reporter: Nira Amit > > If I understand correctly, in scala it's possible to load custom objects from > avro files to RDDs this way: > {code} > ctx.hadoopFile("/path/to/the/avro/file.avro", > classOf[AvroInputFormat[MyClassInAvroFile]], > classOf[AvroWrapper[MyClassInAvroFile]], > classOf[NullWritable]) > {code} > I'm not a scala developer, so I tried to "translate" this to java as best I > could. I created classes that extend AvroKey and FileInputFormat: > {code} > public static class MyCustomAvroKey extends AvroKey{}; > public static class MyCustomAvroReader extends > AvroRecordReaderBase{ > // with my custom schema and all the required methods... > } > public static class MyCustomInputFormat extends > FileInputFormat { > @Override > public RecordReader > createRecordReader(InputSplit inputSplit, TaskAttemptContext > taskAttemptContext) throws IOException, InterruptedException { > return new MyCustomAvroReader(); > } > } > ... > JavaPairRDD records = > sc.newAPIHadoopFile("file:/path/to/datafile.avro", > MyCustomInputFormat.class, MyCustomAvroKey.class, > NullWritable.class, > sc.hadoopConfiguration()); > MyCustomClass first = records.first()._1.datum(); > System.out.println("Got a result, some custom field: " + > first.getSomeCustomField()); > {code} > This compiles fine, but using a debugger I can see that `first._1.datum()` > actually returns a `GenericData$Record` in runtime, not a `MyCustomClass` > instance. > And indeed, when the following line executes: > {code} > MyCustomClass first = records.first()._1.datum(); > {code} > I get an exception: > {code} > java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record > cannot be cast to my.package.containing.MyCustomClass > {code} > Am I doing it wrong? Or is this not possible in Java? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19656) Can't load custom type from avro file to RDD with newAPIHadoopFile
[ https://issues.apache.org/jira/browse/SPARK-19656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15896166#comment-15896166 ] Nira Amit commented on SPARK-19656: --- [~emaynard] There is no "asInstanceOf" method in the Java API. And if I try to cast it directly I get a compilation error. [~sowen] Are you not handling tickets about the Java API? It's the second time you close a ticket I open about loading custom objects from Avro in Java and mark it as "Not a problem". Either this is not possible in Java, in which case it's at least a missing feature (and misleading, because it looks like it should be possible), or I'm not doing it right and in this case you can provide a working code example in Java. > Can't load custom type from avro file to RDD with newAPIHadoopFile > -- > > Key: SPARK-19656 > URL: https://issues.apache.org/jira/browse/SPARK-19656 > Project: Spark > Issue Type: Question > Components: Java API >Affects Versions: 2.0.2 >Reporter: Nira Amit > > If I understand correctly, in scala it's possible to load custom objects from > avro files to RDDs this way: > {code} > ctx.hadoopFile("/path/to/the/avro/file.avro", > classOf[AvroInputFormat[MyClassInAvroFile]], > classOf[AvroWrapper[MyClassInAvroFile]], > classOf[NullWritable]) > {code} > I'm not a scala developer, so I tried to "translate" this to java as best I > could. I created classes that extend AvroKey and FileInputFormat: > {code} > public static class MyCustomAvroKey extends AvroKey{}; > public static class MyCustomAvroReader extends > AvroRecordReaderBase{ > // with my custom schema and all the required methods... > } > public static class MyCustomInputFormat extends > FileInputFormat { > @Override > public RecordReader > createRecordReader(InputSplit inputSplit, TaskAttemptContext > taskAttemptContext) throws IOException, InterruptedException { > return new MyCustomAvroReader(); > } > } > ... > JavaPairRDD records = > sc.newAPIHadoopFile("file:/path/to/datafile.avro", > MyCustomInputFormat.class, MyCustomAvroKey.class, > NullWritable.class, > sc.hadoopConfiguration()); > MyCustomClass first = records.first()._1.datum(); > System.out.println("Got a result, some custom field: " + > first.getSomeCustomField()); > {code} > This compiles fine, but using a debugger I can see that `first._1.datum()` > actually returns a `GenericData$Record` in runtime, not a `MyCustomClass` > instance. > And indeed, when the following line executes: > {code} > MyCustomClass first = records.first()._1.datum(); > {code} > I get an exception: > {code} > java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record > cannot be cast to my.package.containing.MyCustomClass > {code} > Am I doing it wrong? Or is this not possible in Java? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19656) Can't load custom type from avro file to RDD with newAPIHadoopFile
[ https://issues.apache.org/jira/browse/SPARK-19656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-19656. --- Resolution: Not A Problem > Can't load custom type from avro file to RDD with newAPIHadoopFile > -- > > Key: SPARK-19656 > URL: https://issues.apache.org/jira/browse/SPARK-19656 > Project: Spark > Issue Type: Question > Components: Java API >Affects Versions: 2.0.2 >Reporter: Nira Amit > > If I understand correctly, in scala it's possible to load custom objects from > avro files to RDDs this way: > {code} > ctx.hadoopFile("/path/to/the/avro/file.avro", > classOf[AvroInputFormat[MyClassInAvroFile]], > classOf[AvroWrapper[MyClassInAvroFile]], > classOf[NullWritable]) > {code} > I'm not a scala developer, so I tried to "translate" this to java as best I > could. I created classes that extend AvroKey and FileInputFormat: > {code} > public static class MyCustomAvroKey extends AvroKey{}; > public static class MyCustomAvroReader extends > AvroRecordReaderBase{ > // with my custom schema and all the required methods... > } > public static class MyCustomInputFormat extends > FileInputFormat { > @Override > public RecordReader > createRecordReader(InputSplit inputSplit, TaskAttemptContext > taskAttemptContext) throws IOException, InterruptedException { > return new MyCustomAvroReader(); > } > } > ... > JavaPairRDD records = > sc.newAPIHadoopFile("file:/path/to/datafile.avro", > MyCustomInputFormat.class, MyCustomAvroKey.class, > NullWritable.class, > sc.hadoopConfiguration()); > MyCustomClass first = records.first()._1.datum(); > System.out.println("Got a result, some custom field: " + > first.getSomeCustomField()); > {code} > This compiles fine, but using a debugger I can see that `first._1.datum()` > actually returns a `GenericData$Record` in runtime, not a `MyCustomClass` > instance. > And indeed, when the following line executes: > {code} > MyCustomClass first = records.first()._1.datum(); > {code} > I get an exception: > {code} > java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record > cannot be cast to my.package.containing.MyCustomClass > {code} > Am I doing it wrong? Or is this not possible in Java? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19792) In the Master Page,the column named “Memory per Node” ,I think it is not all right
[ https://issues.apache.org/jira/browse/SPARK-19792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned SPARK-19792: - Assignee: liuxian > In the Master Page,the column named “Memory per Node” ,I think it is not all > right > --- > > Key: SPARK-19792 > URL: https://issues.apache.org/jira/browse/SPARK-19792 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.1.0 >Reporter: liuxian >Assignee: liuxian >Priority: Trivial > Fix For: 2.2.0 > > > Open the spark web page,in the Master Page ,have two tables:Running > Applications table and Completed Applications table, to the column named > “Memory per Node” ,I think it is not all right ,because a node may be not > have only one executor.So I think that should be named as “Memory per > Executor”.Otherwise easy to let the user misunderstanding -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19792) In the Master Page,the column named “Memory per Node” ,I think it is not all right
[ https://issues.apache.org/jira/browse/SPARK-19792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-19792. --- Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 17132 [https://github.com/apache/spark/pull/17132] > In the Master Page,the column named “Memory per Node” ,I think it is not all > right > --- > > Key: SPARK-19792 > URL: https://issues.apache.org/jira/browse/SPARK-19792 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.1.0 >Reporter: liuxian >Priority: Trivial > Fix For: 2.2.0 > > > Open the spark web page,in the Master Page ,have two tables:Running > Applications table and Completed Applications table, to the column named > “Memory per Node” ,I think it is not all right ,because a node may be not > have only one executor.So I think that should be named as “Memory per > Executor”.Otherwise easy to let the user misunderstanding -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19713) saveAsTable
[ https://issues.apache.org/jira/browse/SPARK-19713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-19713. --- Resolution: Not A Problem > saveAsTable > --- > > Key: SPARK-19713 > URL: https://issues.apache.org/jira/browse/SPARK-19713 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1 >Reporter: Balaram R Gadiraju > > Hi, > I just observed that when we use dataframe.saveAsTable("table") -- In > oldversions > and dataframe.write.saveAsTable("table") -- in the newer versions > When using the method “df3.saveAsTable("brokentable")” in > scale code. This creates a folder in hdfs and doesn’t update hive-metastore > that it plans to create the table. So if anything goes wrong in between the > folder still exists and hive is not aware of the folder creation. This will > block the users from creating the table “brokentable” as the folder already > exists, we can remove the folder using “hadoop fs –rmr > /data/hive/databases/testdb.db/brokentable”. So below is the workaround > which will enable to you to continue the development work. > Current Code: > val df3 = sqlContext.sql("select * fromtesttable") > df3.saveAsTable("brokentable") > THE WORKAROUND: > By registering the DataFrame as table and then using sql command to load the > data will resolve the issue. EX: > val df3 = sqlContext.sql("select * from testtable").registerTempTable("df3") > sqlContext.sql("CREATE TABLE brokentable AS SELECT * FROM df3") -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6407) Streaming ALS for Collaborative Filtering
[ https://issues.apache.org/jira/browse/SPARK-6407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15896154#comment-15896154 ] Sean Owen commented on SPARK-6407: -- Computing one or two iterations per update -- as in every time someone clicks on a product or something? no, that's way way too slow. Each would launch tens of large distributed jobs. In practice fold-in works fine. Folding in a day or so of updates has been OK. The question isn't RMSE but how it affects actual rankings of items in recommendations, and it takes a while before the effect of the approximation actually changes a rank. > Streaming ALS for Collaborative Filtering > - > > Key: SPARK-6407 > URL: https://issues.apache.org/jira/browse/SPARK-6407 > Project: Spark > Issue Type: New Feature > Components: DStreams >Reporter: Felix Cheung >Priority: Minor > > Like MLLib's ALS implementation for recommendation, and applying to streaming. > Similar to streaming linear regression, logistic regression, could we apply > gradient updates to batches of data and reuse existing MLLib implementation? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19737) New analysis rule for reporting unregistered functions without relying on relation resolution
[ https://issues.apache.org/jira/browse/SPARK-19737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-19737: --- Description: Let's consider the following simple SQL query that reference an undefined function {{foo}} that is never registered in the function registry: {code:sql} SELECT foo(a) FROM t {code} Assuming table {{t}} is a partitioned temporary view consisting of a large number of files stored on S3, then it may take the analyzer a long time before realizing that {{foo}} is not registered yet. The reason is that the existing analysis rule {{ResolveFunctions}} requires all child expressions to be resolved first. Therefore, {{ResolveRelations}} has to be executed first to resolve all columns referenced by the unresolved function invocation. This further leads to partition discovery for {{t}}, which may take a long time. To address this case, we propose a new lightweight analysis rule {{LookupFunctions}} that # Matches all unresolved function invocations # Look up the function names from the function registry # Report analysis error for any unregistered functions Since this rule doesn't actually try to resolve the unresolved functions, it doesn't rely on {{ResolveRelations}} and therefore doesn't trigger partition discovery. We may put this analysis rule in a separate {{Once}} rule batch that sits between the "Substitution" batch and the "Resolution" batch to avoid running it repeatedly and make sure it gets executed before {{ResolveRelations}}. was: Let's consider the following simple SQL query that reference an invalid function {{foo}} that is never registered in the function registry: {code:sql} SELECT foo(a) FROM t {code} Assuming table {{t}} is a partitioned temporary view consisting of a large number of files stored on S3, then it may take the analyzer a long time before realizing that {{foo}} is not registered yet. The reason is that the existing analysis rule {{ResolveFunctions}} requires all child expressions to be resolved first. Therefore, {{ResolveRelations}} has to be executed first to resolve all columns referenced by the unresolved function invocation. This further leads to partition discovery for {{t}}, which may take a long time. To address this case, we propose a new lightweight analysis rule {{LookupFunctions}} that # Matches all unresolved function invocations # Look up the function names from the function registry # Report analysis error for any unregistered functions Since this rule doesn't actually try to resolve the unresolved functions, it doesn't rely on {{ResolveRelations}} and therefore doesn't trigger partition discovery. We may put this analysis rule in a separate {{Once}} rule batch that sits between the "Substitution" batch and the "Resolution" batch to avoid running it repeatedly and make sure it gets executed before {{ResolveRelations}}. > New analysis rule for reporting unregistered functions without relying on > relation resolution > - > > Key: SPARK-19737 > URL: https://issues.apache.org/jira/browse/SPARK-19737 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Cheng Lian > Fix For: 2.2.0 > > > Let's consider the following simple SQL query that reference an undefined > function {{foo}} that is never registered in the function registry: > {code:sql} > SELECT foo(a) FROM t > {code} > Assuming table {{t}} is a partitioned temporary view consisting of a large > number of files stored on S3, then it may take the analyzer a long time > before realizing that {{foo}} is not registered yet. > The reason is that the existing analysis rule {{ResolveFunctions}} requires > all child expressions to be resolved first. Therefore, {{ResolveRelations}} > has to be executed first to resolve all columns referenced by the unresolved > function invocation. This further leads to partition discovery for {{t}}, > which may take a long time. > To address this case, we propose a new lightweight analysis rule > {{LookupFunctions}} that > # Matches all unresolved function invocations > # Look up the function names from the function registry > # Report analysis error for any unregistered functions > Since this rule doesn't actually try to resolve the unresolved functions, it > doesn't rely on {{ResolveRelations}} and therefore doesn't trigger partition > discovery. > We may put this analysis rule in a separate {{Once}} rule batch that sits > between the "Substitution" batch and the "Resolution" batch to avoid running > it repeatedly and make sure it gets executed before {{ResolveRelations}}. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe,
[jira] [Updated] (SPARK-19737) New analysis rule for reporting unregistered functions without relying on relation resolution
[ https://issues.apache.org/jira/browse/SPARK-19737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-19737: --- Description: Let's consider the following simple SQL query that reference an undefined function {{foo}} that is never registered in the function registry: {code:sql} SELECT foo(a) FROM t {code} Assuming table {{t}} is a partitioned temporary view consisting of a large number of files stored on S3, it may take the analyzer a long time before realizing that {{foo}} is not registered yet. The reason is that the existing analysis rule {{ResolveFunctions}} requires all child expressions to be resolved first. Therefore, {{ResolveRelations}} has to be executed first to resolve all columns referenced by the unresolved function invocation. This further leads to partition discovery for {{t}}, which may take a long time. To address this case, we propose a new lightweight analysis rule {{LookupFunctions}} that # Matches all unresolved function invocations # Look up the function names from the function registry # Report analysis error for any unregistered functions Since this rule doesn't actually try to resolve the unresolved functions, it doesn't rely on {{ResolveRelations}} and therefore doesn't trigger partition discovery. We may put this analysis rule in a separate {{Once}} rule batch that sits between the "Substitution" batch and the "Resolution" batch to avoid running it repeatedly and make sure it gets executed before {{ResolveRelations}}. was: Let's consider the following simple SQL query that reference an undefined function {{foo}} that is never registered in the function registry: {code:sql} SELECT foo(a) FROM t {code} Assuming table {{t}} is a partitioned temporary view consisting of a large number of files stored on S3, then it may take the analyzer a long time before realizing that {{foo}} is not registered yet. The reason is that the existing analysis rule {{ResolveFunctions}} requires all child expressions to be resolved first. Therefore, {{ResolveRelations}} has to be executed first to resolve all columns referenced by the unresolved function invocation. This further leads to partition discovery for {{t}}, which may take a long time. To address this case, we propose a new lightweight analysis rule {{LookupFunctions}} that # Matches all unresolved function invocations # Look up the function names from the function registry # Report analysis error for any unregistered functions Since this rule doesn't actually try to resolve the unresolved functions, it doesn't rely on {{ResolveRelations}} and therefore doesn't trigger partition discovery. We may put this analysis rule in a separate {{Once}} rule batch that sits between the "Substitution" batch and the "Resolution" batch to avoid running it repeatedly and make sure it gets executed before {{ResolveRelations}}. > New analysis rule for reporting unregistered functions without relying on > relation resolution > - > > Key: SPARK-19737 > URL: https://issues.apache.org/jira/browse/SPARK-19737 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Cheng Lian > Fix For: 2.2.0 > > > Let's consider the following simple SQL query that reference an undefined > function {{foo}} that is never registered in the function registry: > {code:sql} > SELECT foo(a) FROM t > {code} > Assuming table {{t}} is a partitioned temporary view consisting of a large > number of files stored on S3, it may take the analyzer a long time before > realizing that {{foo}} is not registered yet. > The reason is that the existing analysis rule {{ResolveFunctions}} requires > all child expressions to be resolved first. Therefore, {{ResolveRelations}} > has to be executed first to resolve all columns referenced by the unresolved > function invocation. This further leads to partition discovery for {{t}}, > which may take a long time. > To address this case, we propose a new lightweight analysis rule > {{LookupFunctions}} that > # Matches all unresolved function invocations > # Look up the function names from the function registry > # Report analysis error for any unregistered functions > Since this rule doesn't actually try to resolve the unresolved functions, it > doesn't rely on {{ResolveRelations}} and therefore doesn't trigger partition > discovery. > We may put this analysis rule in a separate {{Once}} rule batch that sits > between the "Substitution" batch and the "Resolution" batch to avoid running > it repeatedly and make sure it gets executed before {{ResolveRelations}}. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: