[jira] [Commented] (SPARK-20286) dynamicAllocation.executorIdleTimeout is ignored after unpersist
[ https://issues.apache.org/jira/browse/SPARK-20286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15965411#comment-15965411 ] Saisai Shao commented on SPARK-20286: - bq. Maybe the best approach is to change from cachedExecutorIdleTimeout to executorIdleTimeout on all cached executors when the last RDD has been unpersisted, and then restart the time counter (unpersist will then count as an action). Yes, I think this is a feasible solution. I can help out it if you're not familiar with that code. > dynamicAllocation.executorIdleTimeout is ignored after unpersist > > > Key: SPARK-20286 > URL: https://issues.apache.org/jira/browse/SPARK-20286 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.1 >Reporter: Miguel Pérez > > With dynamic allocation enabled, it seems that executors with cached data > which are unpersisted are still being killed using the > {{dynamicAllocation.cachedExecutorIdleTimeout}} configuration, instead of > {{dynamicAllocation.executorIdleTimeout}}. Assuming the default configuration > ({{dynamicAllocation.cachedExecutorIdleTimeout = Infinity}}), an executor > with unpersisted data won't be released until the job ends. > *How to reproduce* > - Set different values for {{dynamicAllocation.executorIdleTimeout}} and > {{dynamicAllocation.cachedExecutorIdleTimeout}} > - Load a file into a RDD and persist it > - Execute an action on the RDD (like a count) so some executors are activated. > - When the action has finished, unpersist the RDD > - The application UI removes correctly the persisted data from the *Storage* > tab, but if you look in the *Executors* tab, you will find that the executors > remain *active* until ({{dynamicAllocation.cachedExecutorIdleTimeout}} is > reached. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20286) dynamicAllocation.executorIdleTimeout is ignored after unpersist
[ https://issues.apache.org/jira/browse/SPARK-20286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15965408#comment-15965408 ] Miguel Pérez commented on SPARK-20286: -- Yes. I suppose this is only checked when an executors passes from active to idle status (maybe when the last action has finished), and it's not changed when unpersist is called. But this is problematic, for example, if the job is interactive. Imagine you have a Zeppelin notebook or a Spark shell. The same driver could have different purposes and the session could be opened a long time. Once you have called a single persist, all these executors will have to wait {{dynamicAllocation.cachedExecutorIdleTimeout}} (which is Infinity by default) or the user executing another action (executors will become active again and then idle, but this time without cached data). I don't know the solution. Maybe the best approach is to change from cachedExecutorIdleTimeout to executorIdleTimeout on all cached executors when the last RDD has been unpersisted, and then restart the time counter (unpersist will then count as an action). > dynamicAllocation.executorIdleTimeout is ignored after unpersist > > > Key: SPARK-20286 > URL: https://issues.apache.org/jira/browse/SPARK-20286 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.1 >Reporter: Miguel Pérez > > With dynamic allocation enabled, it seems that executors with cached data > which are unpersisted are still being killed using the > {{dynamicAllocation.cachedExecutorIdleTimeout}} configuration, instead of > {{dynamicAllocation.executorIdleTimeout}}. Assuming the default configuration > ({{dynamicAllocation.cachedExecutorIdleTimeout = Infinity}}), an executor > with unpersisted data won't be released until the job ends. > *How to reproduce* > - Set different values for {{dynamicAllocation.executorIdleTimeout}} and > {{dynamicAllocation.cachedExecutorIdleTimeout}} > - Load a file into a RDD and persist it > - Execute an action on the RDD (like a count) so some executors are activated. > - When the action has finished, unpersist the RDD > - The application UI removes correctly the persisted data from the *Storage* > tab, but if you look in the *Executors* tab, you will find that the executors > remain *active* until ({{dynamicAllocation.cachedExecutorIdleTimeout}} is > reached. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20303) Rename createTempFunction to registerFunction
[ https://issues.apache.org/jira/browse/SPARK-20303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15965405#comment-15965405 ] Apache Spark commented on SPARK-20303: -- User 'gatorsmile' has created a pull request for this issue: https://github.com/apache/spark/pull/17615 > Rename createTempFunction to registerFunction > - > > Key: SPARK-20303 > URL: https://issues.apache.org/jira/browse/SPARK-20303 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Xiao Li >Assignee: Xiao Li > > Session catalog API `createTempFunction` is being used by Hive build-in > functions, persistent functions, and temporary functions. Thus, the name is > confusing. This PR is to replace it by `registerFunction`. Also we can move > construction of `FunctionBuilder` and `ExpressionInfo` into the new > `registerFunction`, instead of duplicating the logics everywhere. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20303) Rename createTempFunction to registerFunction
[ https://issues.apache.org/jira/browse/SPARK-20303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20303: Assignee: Xiao Li (was: Apache Spark) > Rename createTempFunction to registerFunction > - > > Key: SPARK-20303 > URL: https://issues.apache.org/jira/browse/SPARK-20303 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Xiao Li >Assignee: Xiao Li > > Session catalog API `createTempFunction` is being used by Hive build-in > functions, persistent functions, and temporary functions. Thus, the name is > confusing. This PR is to replace it by `registerFunction`. Also we can move > construction of `FunctionBuilder` and `ExpressionInfo` into the new > `registerFunction`, instead of duplicating the logics everywhere. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20303) Rename createTempFunction to registerFunction
[ https://issues.apache.org/jira/browse/SPARK-20303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20303: Assignee: Apache Spark (was: Xiao Li) > Rename createTempFunction to registerFunction > - > > Key: SPARK-20303 > URL: https://issues.apache.org/jira/browse/SPARK-20303 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Xiao Li >Assignee: Apache Spark > > Session catalog API `createTempFunction` is being used by Hive build-in > functions, persistent functions, and temporary functions. Thus, the name is > confusing. This PR is to replace it by `registerFunction`. Also we can move > construction of `FunctionBuilder` and `ExpressionInfo` into the new > `registerFunction`, instead of duplicating the logics everywhere. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20303) Rename createTempFunction to registerFunction
[ https://issues.apache.org/jira/browse/SPARK-20303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-20303: Summary: Rename createTempFunction to registerFunction (was: Replace createTempFunction by registerFunction) > Rename createTempFunction to registerFunction > - > > Key: SPARK-20303 > URL: https://issues.apache.org/jira/browse/SPARK-20303 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Xiao Li >Assignee: Xiao Li > > Session catalog API `createTempFunction` is being used by Hive build-in > functions, persistent functions, and temporary functions. Thus, the name is > confusing. This PR is to replace it by `registerFunction`. Also we can move > construction of `FunctionBuilder` and `ExpressionInfo` into the new > `registerFunction`, instead of duplicating the logics everywhere. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20303) Replace createTempFunction by registerFunction
Xiao Li created SPARK-20303: --- Summary: Replace createTempFunction by registerFunction Key: SPARK-20303 URL: https://issues.apache.org/jira/browse/SPARK-20303 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.2.0 Reporter: Xiao Li Assignee: Xiao Li Session catalog API `createTempFunction` is being used by Hive build-in functions, persistent functions, and temporary functions. Thus, the name is confusing. This PR is to replace it by `registerFunction`. Also we can move construction of `FunctionBuilder` and `ExpressionInfo` into the new `registerFunction`, instead of duplicating the logics everywhere. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20302) Short circuit cast when from and to types are structurally the same
[ https://issues.apache.org/jira/browse/SPARK-20302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15965354#comment-15965354 ] Apache Spark commented on SPARK-20302: -- User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/17614 > Short circuit cast when from and to types are structurally the same > --- > > Key: SPARK-20302 > URL: https://issues.apache.org/jira/browse/SPARK-20302 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Reynold Xin >Assignee: Reynold Xin > > When we perform a cast expression and the from and to types are structurally > the same (having the same structure but different field names), we should be > able to skip the actual cast. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20302) Short circuit cast when from and to types are structurally the same
[ https://issues.apache.org/jira/browse/SPARK-20302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20302: Assignee: Apache Spark (was: Reynold Xin) > Short circuit cast when from and to types are structurally the same > --- > > Key: SPARK-20302 > URL: https://issues.apache.org/jira/browse/SPARK-20302 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Reynold Xin >Assignee: Apache Spark > > When we perform a cast expression and the from and to types are structurally > the same (having the same structure but different field names), we should be > able to skip the actual cast. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20302) Short circuit cast when from and to types are structurally the same
[ https://issues.apache.org/jira/browse/SPARK-20302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20302: Assignee: Reynold Xin (was: Apache Spark) > Short circuit cast when from and to types are structurally the same > --- > > Key: SPARK-20302 > URL: https://issues.apache.org/jira/browse/SPARK-20302 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Reynold Xin >Assignee: Reynold Xin > > When we perform a cast expression and the from and to types are structurally > the same (having the same structure but different field names), we should be > able to skip the actual cast. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20302) Short circuit cast when from and to types are structurally the same
[ https://issues.apache.org/jira/browse/SPARK-20302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-20302: Summary: Short circuit cast when from and to types are structurally the same (was: Optimize cast when from and to types are structurally the same) > Short circuit cast when from and to types are structurally the same > --- > > Key: SPARK-20302 > URL: https://issues.apache.org/jira/browse/SPARK-20302 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Reynold Xin >Assignee: Reynold Xin > > When we perform a cast expression and the from and to types are structurally > the same (having the same structure but different field names), we should be > able to skip the actual cast. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20302) Optimize cast when from and to types are structurally the same
Reynold Xin created SPARK-20302: --- Summary: Optimize cast when from and to types are structurally the same Key: SPARK-20302 URL: https://issues.apache.org/jira/browse/SPARK-20302 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.1.0 Reporter: Reynold Xin Assignee: Reynold Xin When we perform a cast expression and the from and to types are structurally the same (having the same structure but different field names), we should be able to skip the actual cast. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19993) Caching logical plans containing subquery expressions does not work.
[ https://issues.apache.org/jira/browse/SPARK-19993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-19993. - Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 17330 [https://github.com/apache/spark/pull/17330] > Caching logical plans containing subquery expressions does not work. > > > Key: SPARK-19993 > URL: https://issues.apache.org/jira/browse/SPARK-19993 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.1.0 >Reporter: Dilip Biswal > Fix For: 2.2.0 > > > Here is a simple repro that depicts the problem. In this case the second > invocation of the sql should have been from the cache. However the lookup > fails currently. > {code} > scala> val ds = spark.sql("select * from s1 where s1.c1 in (select s2.c1 from > s2 where s1.c1 = s2.c1)") > ds: org.apache.spark.sql.DataFrame = [c1: int] > scala> ds.cache > res13: ds.type = [c1: int] > scala> spark.sql("select * from s1 where s1.c1 in (select s2.c1 from s2 where > s1.c1 = s2.c1)").explain(true) > == Analyzed Logical Plan == > c1: int > Project [c1#86] > +- Filter c1#86 IN (list#78 [c1#86]) >: +- Project [c1#87] >: +- Filter (outer(c1#86) = c1#87) >:+- SubqueryAlias s2 >: +- Relation[c1#87] parquet >+- SubqueryAlias s1 > +- Relation[c1#86] parquet > == Optimized Logical Plan == > Join LeftSemi, ((c1#86 = c1#87) && (c1#86 = c1#87)) > :- Relation[c1#86] parquet > +- Relation[c1#87] parquet > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19993) Caching logical plans containing subquery expressions does not work.
[ https://issues.apache.org/jira/browse/SPARK-19993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-19993: --- Assignee: Dilip Biswal > Caching logical plans containing subquery expressions does not work. > > > Key: SPARK-19993 > URL: https://issues.apache.org/jira/browse/SPARK-19993 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.1.0 >Reporter: Dilip Biswal >Assignee: Dilip Biswal > Fix For: 2.2.0 > > > Here is a simple repro that depicts the problem. In this case the second > invocation of the sql should have been from the cache. However the lookup > fails currently. > {code} > scala> val ds = spark.sql("select * from s1 where s1.c1 in (select s2.c1 from > s2 where s1.c1 = s2.c1)") > ds: org.apache.spark.sql.DataFrame = [c1: int] > scala> ds.cache > res13: ds.type = [c1: int] > scala> spark.sql("select * from s1 where s1.c1 in (select s2.c1 from s2 where > s1.c1 = s2.c1)").explain(true) > == Analyzed Logical Plan == > c1: int > Project [c1#86] > +- Filter c1#86 IN (list#78 [c1#86]) >: +- Project [c1#87] >: +- Filter (outer(c1#86) = c1#87) >:+- SubqueryAlias s2 >: +- Relation[c1#87] parquet >+- SubqueryAlias s1 > +- Relation[c1#86] parquet > == Optimized Logical Plan == > Join LeftSemi, ((c1#86 = c1#87) && (c1#86 = c1#87)) > :- Relation[c1#86] parquet > +- Relation[c1#87] parquet > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20291) NaNvl(FloatType, NullType) should not be cast to NaNvl(DoubleType, DoubleType)
[ https://issues.apache.org/jira/browse/SPARK-20291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-20291. - Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 17606 [https://github.com/apache/spark/pull/17606] > NaNvl(FloatType, NullType) should not be cast to NaNvl(DoubleType, > DoubleType) > --- > > Key: SPARK-20291 > URL: https://issues.apache.org/jira/browse/SPARK-20291 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0 >Reporter: DB Tsai >Assignee: DB Tsai > Fix For: 2.2.0 > > > `NaNvl(float value, null)` will be converted into `NaNvl(float value, > Cast(null, DoubleType))` and finally `NaNvl(Cast(float value, DoubleType), > Cast(null, DoubleType))`. > This will cause mismatching in the output type when the input type is float. > By adding extra rule in TypeCoercion can resolve this issue. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20184) performance regression for complex/long sql when enable whole stage codegen
[ https://issues.apache.org/jira/browse/SPARK-20184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15965308#comment-15965308 ] Fei Wang commented on SPARK-20184: -- Tested with a smaller table 100,000 rows. Codegen on: 2.6s Codegen off: 1.5s > performance regression for complex/long sql when enable whole stage codegen > --- > > Key: SPARK-20184 > URL: https://issues.apache.org/jira/browse/SPARK-20184 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.0, 2.1.0 >Reporter: Fei Wang > > The performance of following SQL get much worse in spark 2.x in contrast > with codegen off. > SELECT >sum(COUNTER_57) > ,sum(COUNTER_71) > ,sum(COUNTER_3) > ,sum(COUNTER_70) > ,sum(COUNTER_66) > ,sum(COUNTER_75) > ,sum(COUNTER_69) > ,sum(COUNTER_55) > ,sum(COUNTER_63) > ,sum(COUNTER_68) > ,sum(COUNTER_56) > ,sum(COUNTER_37) > ,sum(COUNTER_51) > ,sum(COUNTER_42) > ,sum(COUNTER_43) > ,sum(COUNTER_1) > ,sum(COUNTER_76) > ,sum(COUNTER_54) > ,sum(COUNTER_44) > ,sum(COUNTER_46) > ,DIM_1 > ,DIM_2 > ,DIM_3 > FROM aggtable group by DIM_1, DIM_2, DIM_3 limit 100; > Num of rows of aggtable is about 3500. > whole stage codegen on(spark.sql.codegen.wholeStage = true):40s > whole stage codegen off(spark.sql.codegen.wholeStage = false):6s > After some analysis i think this is related to the huge java method(a java > method of thousand lines) which generated by codegen. > And If i config -XX:-DontCompileHugeMethods the performance get much > better(about 7s). -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20286) dynamicAllocation.executorIdleTimeout is ignored after unpersist
[ https://issues.apache.org/jira/browse/SPARK-20286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15965306#comment-15965306 ] Saisai Shao commented on SPARK-20286: - So the fix is that if there's no RDD get persisted, executors idle time should be changed to {{dynamicAllocation.executorIdleTimeout}}. > dynamicAllocation.executorIdleTimeout is ignored after unpersist > > > Key: SPARK-20286 > URL: https://issues.apache.org/jira/browse/SPARK-20286 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.1 >Reporter: Miguel Pérez > > With dynamic allocation enabled, it seems that executors with cached data > which are unpersisted are still being killed using the > {{dynamicAllocation.cachedExecutorIdleTimeout}} configuration, instead of > {{dynamicAllocation.executorIdleTimeout}}. Assuming the default configuration > ({{dynamicAllocation.cachedExecutorIdleTimeout = Infinity}}), an executor > with unpersisted data won't be released until the job ends. > *How to reproduce* > - Set different values for {{dynamicAllocation.executorIdleTimeout}} and > {{dynamicAllocation.cachedExecutorIdleTimeout}} > - Load a file into a RDD and persist it > - Execute an action on the RDD (like a count) so some executors are activated. > - When the action has finished, unpersist the RDD > - The application UI removes correctly the persisted data from the *Storage* > tab, but if you look in the *Executors* tab, you will find that the executors > remain *active* until ({{dynamicAllocation.cachedExecutorIdleTimeout}} is > reached. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19659) Fetch big blocks to disk when shuffle-read
[ https://issues.apache.org/jira/browse/SPARK-19659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15965302#comment-15965302 ] Wenchen Fan commented on SPARK-19659: - Seems your model is, all shuffle blocks should be fetched into disk, but we have a buffer to put some of the shuffle blocks in memory, and the buffer size is controlled by {{spark.reducer.maxBytesShuffleToMemory}}. Overall I think this is a good model, but instead of using {{spark.reducer.maxBytesShuffleToMemory}}, can we leverage memory manager and allocate as much memory as possible for this buffer? > Fetch big blocks to disk when shuffle-read > -- > > Key: SPARK-19659 > URL: https://issues.apache.org/jira/browse/SPARK-19659 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Affects Versions: 2.1.0 >Reporter: jin xing > Attachments: SPARK-19659-design-v1.pdf, SPARK-19659-design-v2.pdf > > > Currently the whole block is fetched into memory(offheap by default) when > shuffle-read. A block is defined by (shuffleId, mapId, reduceId). Thus it can > be large when skew situations. If OOM happens during shuffle read, job will > be killed and users will be notified to "Consider boosting > spark.yarn.executor.memoryOverhead". Adjusting parameter and allocating more > memory can resolve the OOM. However the approach is not perfectly suitable > for production environment, especially for data warehouse. > Using Spark SQL as data engine in warehouse, users hope to have a unified > parameter(e.g. memory) but less resource wasted(resource is allocated but not > used), > It's not always easy to predict skew situations, when happen, it make sense > to fetch remote blocks to disk for shuffle-read, rather than > kill the job because of OOM. This approach is mentioned during the discussion > in SPARK-3019, by [~sandyr] and [~mridulm80] -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20301) Flakiness in StreamingAggregationSuite
[ https://issues.apache.org/jira/browse/SPARK-20301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15965297#comment-15965297 ] Apache Spark commented on SPARK-20301: -- User 'brkyvz' has created a pull request for this issue: https://github.com/apache/spark/pull/17613 > Flakiness in StreamingAggregationSuite > -- > > Key: SPARK-20301 > URL: https://issues.apache.org/jira/browse/SPARK-20301 > Project: Spark > Issue Type: Test > Components: Structured Streaming >Affects Versions: 2.1.0 >Reporter: Burak Yavuz >Assignee: Burak Yavuz > Labels: flaky-test > > https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.sql.streaming.StreamingAggregationSuite -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20301) Flakiness in StreamingAggregationSuite
[ https://issues.apache.org/jira/browse/SPARK-20301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20301: Assignee: Burak Yavuz (was: Apache Spark) > Flakiness in StreamingAggregationSuite > -- > > Key: SPARK-20301 > URL: https://issues.apache.org/jira/browse/SPARK-20301 > Project: Spark > Issue Type: Test > Components: Structured Streaming >Affects Versions: 2.1.0 >Reporter: Burak Yavuz >Assignee: Burak Yavuz > Labels: flaky-test > > https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.sql.streaming.StreamingAggregationSuite -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20301) Flakiness in StreamingAggregationSuite
[ https://issues.apache.org/jira/browse/SPARK-20301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20301: Assignee: Apache Spark (was: Burak Yavuz) > Flakiness in StreamingAggregationSuite > -- > > Key: SPARK-20301 > URL: https://issues.apache.org/jira/browse/SPARK-20301 > Project: Spark > Issue Type: Test > Components: Structured Streaming >Affects Versions: 2.1.0 >Reporter: Burak Yavuz >Assignee: Apache Spark > Labels: flaky-test > > https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.sql.streaming.StreamingAggregationSuite -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20199) GradientBoostedTreesModel doesn't have Column Sampling Rate Paramenter
[ https://issues.apache.org/jira/browse/SPARK-20199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15965296#comment-15965296 ] Yan Facai (颜发才) commented on SPARK-20199: - Yes, as [~pralabhkumar] said, DecisionTree hardcodes featureSubsetStrategy. How about adding setFeatureSubsetStrategy for DecisionTree? > GradientBoostedTreesModel doesn't have Column Sampling Rate Paramenter > --- > > Key: SPARK-20199 > URL: https://issues.apache.org/jira/browse/SPARK-20199 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 2.1.0 >Reporter: pralabhkumar >Priority: Minor > > Spark GradientBoostedTreesModel doesn't have Column sampling rate parameter > . This parameter is available in H2O and XGBoost. > Sample from H2O.ai > gbmParams._col_sample_rate > Please provide the parameter . -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20301) Flakiness in StreamingAggregationSuite
Burak Yavuz created SPARK-20301: --- Summary: Flakiness in StreamingAggregationSuite Key: SPARK-20301 URL: https://issues.apache.org/jira/browse/SPARK-20301 Project: Spark Issue Type: Test Components: Structured Streaming Affects Versions: 2.1.0 Reporter: Burak Yavuz Assignee: Burak Yavuz https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.sql.streaming.StreamingAggregationSuite -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20301) Flakiness in StreamingAggregationSuite
[ https://issues.apache.org/jira/browse/SPARK-20301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Burak Yavuz updated SPARK-20301: Labels: flaky-test (was: ) > Flakiness in StreamingAggregationSuite > -- > > Key: SPARK-20301 > URL: https://issues.apache.org/jira/browse/SPARK-20301 > Project: Spark > Issue Type: Test > Components: Structured Streaming >Affects Versions: 2.1.0 >Reporter: Burak Yavuz >Assignee: Burak Yavuz > Labels: flaky-test > > https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.sql.streaming.StreamingAggregationSuite -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-20297) Parquet Decimal(12,2) written by Spark is unreadable by Hive and Impala
[ https://issues.apache.org/jira/browse/SPARK-20297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15965273#comment-15965273 ] Hyukjin Kwon edited comment on SPARK-20297 at 4/12/17 2:24 AM: --- Let me leave some pointers about related PRs - https://github.com/apache/spark/pull/8566 and https://github.com/apache/spark/pull/6617. cc [~lian cheng]] was (Author: hyukjin.kwon): Let me leave some pointers about related PRs - https://github.com/apache/spark/pull/8566 and https://github.com/apache/spark/pull/6617. cc [~liancheng] > Parquet Decimal(12,2) written by Spark is unreadable by Hive and Impala > --- > > Key: SPARK-20297 > URL: https://issues.apache.org/jira/browse/SPARK-20297 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Mostafa Mokhtar > Labels: integration > > While trying to load some data using Spark 2.1 I realized that decimal(12,2) > columns stored in Parquet written by Spark are not readable by Hive or Impala. > Repro > {code} > CREATE TABLE customer_acctbal( > c_acctbal decimal(12,2)) > STORED AS Parquet; > insert into customer_acctbal values (7539.95); > {code} > Error from Hive > {code} > Failed with exception > java.io.IOException:parquet.io.ParquetDecodingException: Can not read value > at 1 in block 0 in file > hdfs://server1:8020/user/hive/warehouse/tpch_nested_3000_parquet.db/customer_acctbal/part-0-03d6e3bb-fe5e-4f20-87a4-88dec955dfcd.snappy.parquet > Time taken: 0.122 seconds > {code} > Error from Impala > {code} > File > 'hdfs://server:8020/user/hive/warehouse/tpch_nested_3000_parquet.db/customer_acctbal/part-0-32db4c61-fe67-4be2-9c16-b55c75c517a4.snappy.parquet' > has an incompatible Parquet schema for column > 'tpch_nested_3000_parquet.customer_acctbal.c_acctbal'. Column type: > DECIMAL(12,2), Parquet schema: > optional int64 c_acctbal [i:0 d:1 r:0] (1 of 2 similar) > {code} > Table info > {code} > hive> describe formatted customer_acctbal; > OK > # col_name data_type comment > c_acctbal decimal(12,2) > # Detailed Table Information > Database: tpch_nested_3000_parquet > Owner: mmokhtar > CreateTime: Mon Apr 10 17:47:24 PDT 2017 > LastAccessTime: UNKNOWN > Protect Mode: None > Retention: 0 > Location: > hdfs://server1.com:8020/user/hive/warehouse/tpch_nested_3000_parquet.db/customer_acctbal > Table Type: MANAGED_TABLE > Table Parameters: > COLUMN_STATS_ACCURATE true > numFiles1 > numRows 0 > rawDataSize 0 > totalSize 120 > transient_lastDdlTime 1491871644 > # Storage Information > SerDe Library: > org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe > InputFormat: > org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat > OutputFormat: > org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat > Compressed: No > Num Buckets:-1 > Bucket Columns: [] > Sort Columns: [] > Storage Desc Params: > serialization.format1 > Time taken: 0.032 seconds, Fetched: 31 row(s) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-20297) Parquet Decimal(12,2) written by Spark is unreadable by Hive and Impala
[ https://issues.apache.org/jira/browse/SPARK-20297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15965273#comment-15965273 ] Hyukjin Kwon edited comment on SPARK-20297 at 4/12/17 2:23 AM: --- Let me leave some pointers about related PRs - https://github.com/apache/spark/pull/8566 and https://github.com/apache/spark/pull/6617. cc [~liancheng] was (Author: hyukjin.kwon): Let me leave some pointers about related PRs - https://github.com/apache/spark/pull/8566 and https://github.com/apache/spark/pull/8566. cc [~liancheng] > Parquet Decimal(12,2) written by Spark is unreadable by Hive and Impala > --- > > Key: SPARK-20297 > URL: https://issues.apache.org/jira/browse/SPARK-20297 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Mostafa Mokhtar > Labels: integration > > While trying to load some data using Spark 2.1 I realized that decimal(12,2) > columns stored in Parquet written by Spark are not readable by Hive or Impala. > Repro > {code} > CREATE TABLE customer_acctbal( > c_acctbal decimal(12,2)) > STORED AS Parquet; > insert into customer_acctbal values (7539.95); > {code} > Error from Hive > {code} > Failed with exception > java.io.IOException:parquet.io.ParquetDecodingException: Can not read value > at 1 in block 0 in file > hdfs://server1:8020/user/hive/warehouse/tpch_nested_3000_parquet.db/customer_acctbal/part-0-03d6e3bb-fe5e-4f20-87a4-88dec955dfcd.snappy.parquet > Time taken: 0.122 seconds > {code} > Error from Impala > {code} > File > 'hdfs://server:8020/user/hive/warehouse/tpch_nested_3000_parquet.db/customer_acctbal/part-0-32db4c61-fe67-4be2-9c16-b55c75c517a4.snappy.parquet' > has an incompatible Parquet schema for column > 'tpch_nested_3000_parquet.customer_acctbal.c_acctbal'. Column type: > DECIMAL(12,2), Parquet schema: > optional int64 c_acctbal [i:0 d:1 r:0] (1 of 2 similar) > {code} > Table info > {code} > hive> describe formatted customer_acctbal; > OK > # col_name data_type comment > c_acctbal decimal(12,2) > # Detailed Table Information > Database: tpch_nested_3000_parquet > Owner: mmokhtar > CreateTime: Mon Apr 10 17:47:24 PDT 2017 > LastAccessTime: UNKNOWN > Protect Mode: None > Retention: 0 > Location: > hdfs://server1.com:8020/user/hive/warehouse/tpch_nested_3000_parquet.db/customer_acctbal > Table Type: MANAGED_TABLE > Table Parameters: > COLUMN_STATS_ACCURATE true > numFiles1 > numRows 0 > rawDataSize 0 > totalSize 120 > transient_lastDdlTime 1491871644 > # Storage Information > SerDe Library: > org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe > InputFormat: > org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat > OutputFormat: > org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat > Compressed: No > Num Buckets:-1 > Bucket Columns: [] > Sort Columns: [] > Storage Desc Params: > serialization.format1 > Time taken: 0.032 seconds, Fetched: 31 row(s) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20297) Parquet Decimal(12,2) written by Spark is unreadable by Hive and Impala
[ https://issues.apache.org/jira/browse/SPARK-20297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15965273#comment-15965273 ] Hyukjin Kwon commented on SPARK-20297: -- Let me leave some pointers about related PRs - https://github.com/apache/spark/pull/8566 and https://github.com/apache/spark/pull/8566. cc [~liancheng] > Parquet Decimal(12,2) written by Spark is unreadable by Hive and Impala > --- > > Key: SPARK-20297 > URL: https://issues.apache.org/jira/browse/SPARK-20297 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Mostafa Mokhtar > Labels: integration > > While trying to load some data using Spark 2.1 I realized that decimal(12,2) > columns stored in Parquet written by Spark are not readable by Hive or Impala. > Repro > {code} > CREATE TABLE customer_acctbal( > c_acctbal decimal(12,2)) > STORED AS Parquet; > insert into customer_acctbal values (7539.95); > {code} > Error from Hive > {code} > Failed with exception > java.io.IOException:parquet.io.ParquetDecodingException: Can not read value > at 1 in block 0 in file > hdfs://server1:8020/user/hive/warehouse/tpch_nested_3000_parquet.db/customer_acctbal/part-0-03d6e3bb-fe5e-4f20-87a4-88dec955dfcd.snappy.parquet > Time taken: 0.122 seconds > {code} > Error from Impala > {code} > File > 'hdfs://server:8020/user/hive/warehouse/tpch_nested_3000_parquet.db/customer_acctbal/part-0-32db4c61-fe67-4be2-9c16-b55c75c517a4.snappy.parquet' > has an incompatible Parquet schema for column > 'tpch_nested_3000_parquet.customer_acctbal.c_acctbal'. Column type: > DECIMAL(12,2), Parquet schema: > optional int64 c_acctbal [i:0 d:1 r:0] (1 of 2 similar) > {code} > Table info > {code} > hive> describe formatted customer_acctbal; > OK > # col_name data_type comment > c_acctbal decimal(12,2) > # Detailed Table Information > Database: tpch_nested_3000_parquet > Owner: mmokhtar > CreateTime: Mon Apr 10 17:47:24 PDT 2017 > LastAccessTime: UNKNOWN > Protect Mode: None > Retention: 0 > Location: > hdfs://server1.com:8020/user/hive/warehouse/tpch_nested_3000_parquet.db/customer_acctbal > Table Type: MANAGED_TABLE > Table Parameters: > COLUMN_STATS_ACCURATE true > numFiles1 > numRows 0 > rawDataSize 0 > totalSize 120 > transient_lastDdlTime 1491871644 > # Storage Information > SerDe Library: > org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe > InputFormat: > org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat > OutputFormat: > org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat > Compressed: No > Num Buckets:-1 > Bucket Columns: [] > Sort Columns: [] > Storage Desc Params: > serialization.format1 > Time taken: 0.032 seconds, Fetched: 31 row(s) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-20297) Parquet Decimal(12,2) written by Spark is unreadable by Hive and Impala
[ https://issues.apache.org/jira/browse/SPARK-20297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15965265#comment-15965265 ] Hyukjin Kwon edited comment on SPARK-20297 at 4/12/17 2:21 AM: --- Thank you so much for trying out [~mmokhtar]. Do you maybe think this JIRA is resolvable maybe? was (Author: hyukjin.kwon): Thank you so much for trying out [~mmokhtar]. Do you maybe think this JIRA is resolvable maybe? Up to my knowledge, this option means to follow Parquet's specification rather than the current way used by Spark. So, if other implementation follows Parquet's specification, I guess this is the correct option for compatibility. > Parquet Decimal(12,2) written by Spark is unreadable by Hive and Impala > --- > > Key: SPARK-20297 > URL: https://issues.apache.org/jira/browse/SPARK-20297 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Mostafa Mokhtar > Labels: integration > > While trying to load some data using Spark 2.1 I realized that decimal(12,2) > columns stored in Parquet written by Spark are not readable by Hive or Impala. > Repro > {code} > CREATE TABLE customer_acctbal( > c_acctbal decimal(12,2)) > STORED AS Parquet; > insert into customer_acctbal values (7539.95); > {code} > Error from Hive > {code} > Failed with exception > java.io.IOException:parquet.io.ParquetDecodingException: Can not read value > at 1 in block 0 in file > hdfs://server1:8020/user/hive/warehouse/tpch_nested_3000_parquet.db/customer_acctbal/part-0-03d6e3bb-fe5e-4f20-87a4-88dec955dfcd.snappy.parquet > Time taken: 0.122 seconds > {code} > Error from Impala > {code} > File > 'hdfs://server:8020/user/hive/warehouse/tpch_nested_3000_parquet.db/customer_acctbal/part-0-32db4c61-fe67-4be2-9c16-b55c75c517a4.snappy.parquet' > has an incompatible Parquet schema for column > 'tpch_nested_3000_parquet.customer_acctbal.c_acctbal'. Column type: > DECIMAL(12,2), Parquet schema: > optional int64 c_acctbal [i:0 d:1 r:0] (1 of 2 similar) > {code} > Table info > {code} > hive> describe formatted customer_acctbal; > OK > # col_name data_type comment > c_acctbal decimal(12,2) > # Detailed Table Information > Database: tpch_nested_3000_parquet > Owner: mmokhtar > CreateTime: Mon Apr 10 17:47:24 PDT 2017 > LastAccessTime: UNKNOWN > Protect Mode: None > Retention: 0 > Location: > hdfs://server1.com:8020/user/hive/warehouse/tpch_nested_3000_parquet.db/customer_acctbal > Table Type: MANAGED_TABLE > Table Parameters: > COLUMN_STATS_ACCURATE true > numFiles1 > numRows 0 > rawDataSize 0 > totalSize 120 > transient_lastDdlTime 1491871644 > # Storage Information > SerDe Library: > org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe > InputFormat: > org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat > OutputFormat: > org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat > Compressed: No > Num Buckets:-1 > Bucket Columns: [] > Sort Columns: [] > Storage Desc Params: > serialization.format1 > Time taken: 0.032 seconds, Fetched: 31 row(s) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20297) Parquet Decimal(12,2) written by Spark is unreadable by Hive and Impala
[ https://issues.apache.org/jira/browse/SPARK-20297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15965268#comment-15965268 ] Hyukjin Kwon commented on SPARK-20297: -- Oh wait, I am sorry. It does follow the newer standard - https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#decimal and I missed the documentation. Wouldn't it be then bugs in Impala or Hive? > Parquet Decimal(12,2) written by Spark is unreadable by Hive and Impala > --- > > Key: SPARK-20297 > URL: https://issues.apache.org/jira/browse/SPARK-20297 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Mostafa Mokhtar > Labels: integration > > While trying to load some data using Spark 2.1 I realized that decimal(12,2) > columns stored in Parquet written by Spark are not readable by Hive or Impala. > Repro > {code} > CREATE TABLE customer_acctbal( > c_acctbal decimal(12,2)) > STORED AS Parquet; > insert into customer_acctbal values (7539.95); > {code} > Error from Hive > {code} > Failed with exception > java.io.IOException:parquet.io.ParquetDecodingException: Can not read value > at 1 in block 0 in file > hdfs://server1:8020/user/hive/warehouse/tpch_nested_3000_parquet.db/customer_acctbal/part-0-03d6e3bb-fe5e-4f20-87a4-88dec955dfcd.snappy.parquet > Time taken: 0.122 seconds > {code} > Error from Impala > {code} > File > 'hdfs://server:8020/user/hive/warehouse/tpch_nested_3000_parquet.db/customer_acctbal/part-0-32db4c61-fe67-4be2-9c16-b55c75c517a4.snappy.parquet' > has an incompatible Parquet schema for column > 'tpch_nested_3000_parquet.customer_acctbal.c_acctbal'. Column type: > DECIMAL(12,2), Parquet schema: > optional int64 c_acctbal [i:0 d:1 r:0] (1 of 2 similar) > {code} > Table info > {code} > hive> describe formatted customer_acctbal; > OK > # col_name data_type comment > c_acctbal decimal(12,2) > # Detailed Table Information > Database: tpch_nested_3000_parquet > Owner: mmokhtar > CreateTime: Mon Apr 10 17:47:24 PDT 2017 > LastAccessTime: UNKNOWN > Protect Mode: None > Retention: 0 > Location: > hdfs://server1.com:8020/user/hive/warehouse/tpch_nested_3000_parquet.db/customer_acctbal > Table Type: MANAGED_TABLE > Table Parameters: > COLUMN_STATS_ACCURATE true > numFiles1 > numRows 0 > rawDataSize 0 > totalSize 120 > transient_lastDdlTime 1491871644 > # Storage Information > SerDe Library: > org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe > InputFormat: > org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat > OutputFormat: > org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat > Compressed: No > Num Buckets:-1 > Bucket Columns: [] > Sort Columns: [] > Storage Desc Params: > serialization.format1 > Time taken: 0.032 seconds, Fetched: 31 row(s) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20297) Parquet Decimal(12,2) written by Spark is unreadable by Hive and Impala
[ https://issues.apache.org/jira/browse/SPARK-20297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15965265#comment-15965265 ] Hyukjin Kwon commented on SPARK-20297: -- Thank you so much for trying out [~mmokhtar]. Do you maybe think this JIRA is resolvable maybe? Up to my knowledge, this option means to follow Parquet's specification rather than the current way used by Spark. So, if other implementation follows Parquet's specification, I guess this is the correct option for compatibility. > Parquet Decimal(12,2) written by Spark is unreadable by Hive and Impala > --- > > Key: SPARK-20297 > URL: https://issues.apache.org/jira/browse/SPARK-20297 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Mostafa Mokhtar > Labels: integration > > While trying to load some data using Spark 2.1 I realized that decimal(12,2) > columns stored in Parquet written by Spark are not readable by Hive or Impala. > Repro > {code} > CREATE TABLE customer_acctbal( > c_acctbal decimal(12,2)) > STORED AS Parquet; > insert into customer_acctbal values (7539.95); > {code} > Error from Hive > {code} > Failed with exception > java.io.IOException:parquet.io.ParquetDecodingException: Can not read value > at 1 in block 0 in file > hdfs://server1:8020/user/hive/warehouse/tpch_nested_3000_parquet.db/customer_acctbal/part-0-03d6e3bb-fe5e-4f20-87a4-88dec955dfcd.snappy.parquet > Time taken: 0.122 seconds > {code} > Error from Impala > {code} > File > 'hdfs://server:8020/user/hive/warehouse/tpch_nested_3000_parquet.db/customer_acctbal/part-0-32db4c61-fe67-4be2-9c16-b55c75c517a4.snappy.parquet' > has an incompatible Parquet schema for column > 'tpch_nested_3000_parquet.customer_acctbal.c_acctbal'. Column type: > DECIMAL(12,2), Parquet schema: > optional int64 c_acctbal [i:0 d:1 r:0] (1 of 2 similar) > {code} > Table info > {code} > hive> describe formatted customer_acctbal; > OK > # col_name data_type comment > c_acctbal decimal(12,2) > # Detailed Table Information > Database: tpch_nested_3000_parquet > Owner: mmokhtar > CreateTime: Mon Apr 10 17:47:24 PDT 2017 > LastAccessTime: UNKNOWN > Protect Mode: None > Retention: 0 > Location: > hdfs://server1.com:8020/user/hive/warehouse/tpch_nested_3000_parquet.db/customer_acctbal > Table Type: MANAGED_TABLE > Table Parameters: > COLUMN_STATS_ACCURATE true > numFiles1 > numRows 0 > rawDataSize 0 > totalSize 120 > transient_lastDdlTime 1491871644 > # Storage Information > SerDe Library: > org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe > InputFormat: > org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat > OutputFormat: > org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat > Compressed: No > Num Buckets:-1 > Bucket Columns: [] > Sort Columns: [] > Storage Desc Params: > serialization.format1 > Time taken: 0.032 seconds, Fetched: 31 row(s) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20294) _inferSchema for RDDs fails if sample returns empty RDD
[ https://issues.apache.org/jira/browse/SPARK-20294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15965264#comment-15965264 ] Hyukjin Kwon commented on SPARK-20294: -- Also, it infers schema from the first row if sampling ratio is not given up to my knowledge. {code} >>> small_rdd = sc.parallelize([(1, 2), (2, 'foo')]) >>> small_rdd.toDF().printSchema >>> small_rdd = sc.parallelize([(1, 2), ('a', 'b')]) >>> small_rdd.toDF().printSchema >>> small_rdd = sc.parallelize([(1, 1.1), ('a', 'b')]) >>> small_rdd.toDF().printSchema {code} > _inferSchema for RDDs fails if sample returns empty RDD > --- > > Key: SPARK-20294 > URL: https://issues.apache.org/jira/browse/SPARK-20294 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.1.0 >Reporter: João Pedro Jericó >Priority: Minor > > Currently the _inferSchema function on > [session.py](https://github.com/apache/spark/blob/master/python/pyspark/sql/session.py#L354) > line 354 fails if applied to an RDD for which the sample call returns an > empty RDD. This is possible for example if one has a small RDD but that needs > the schema to be inferred by more than one Row. For example: > ```python > small_rdd = sc.parallelize([(1, 2), (2, 'foo')]) > small_rdd.toDF(samplingRatio=0.01).show() > ``` > This will fail with high probability because when sampling the small_rdd with > the .sample method it will return an empty RDD most of the time. However, > this is not the desired result because we are able to sample at least 1% of > the RDD. > This is probably a problem with the other Spark APIs however I don't have the > knowledge to look at the source code for other languages. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-3383) DecisionTree aggregate size could be smaller
[ https://issues.apache.org/jira/browse/SPARK-3383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15963741#comment-15963741 ] Yan Facai (颜发才) edited comment on SPARK-3383 at 4/12/17 2:02 AM: - I think the task contains two subtask: 1. separate `split` with `bin`: Now for each categorical feature, there is 1 bin per split. That's said, for N categories, the communicate cost is 2^(N-1) - 1 bins. However, if we only get stats for each category, and construct splits finally. Namely, 1 bin per category. The communicate cost is N bins. 2. As said in Description, store all but the last bin, and also store the total statistics for each node. The communicate cost will be N-1 bins. I have a question: 1. why unordered features only are allowed in multiclass classification? was (Author: facai): I think the task contains two subtask: 1. separate `split` with `bin`: Now for each categorical feature, there is 1 bin per split. That's said, for N categories, the communicate cost is 2^{N-1} - 1 bins. However, if we only get stats for each category, and construct splits finally. Namely, 1 bin per category. The communicate cost is N bins. 2. As said in Description, store all but the last bin, and also store the total statistics for each node. The communicate cost will be N-1 bins. I have a question: 1. why unordered features only are allowed in multiclass classification? > DecisionTree aggregate size could be smaller > > > Key: SPARK-3383 > URL: https://issues.apache.org/jira/browse/SPARK-3383 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.1.0 >Reporter: Joseph K. Bradley >Priority: Minor > > Storage and communication optimization: > DecisionTree aggregate statistics could store less data (described below). > The savings would be significant for datasets with many low-arity categorical > features (binary features, or unordered categorical features). Savings would > be negligible for continuous features. > DecisionTree stores a vector sufficient statistics for each (node, feature, > bin). We could store 1 fewer bin per (node, feature): For a given (node, > feature), if we store these vectors for all but the last bin, and also store > the total statistics for each node, then we could compute the statistics for > the last bin. For binary and unordered categorical features, this would cut > in half the number of bins to store and communicate. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3383) DecisionTree aggregate size could be smaller
[ https://issues.apache.org/jira/browse/SPARK-3383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15965263#comment-15965263 ] Yan Facai (颜发才) commented on SPARK-3383: How about the idea? 1. We use `bin` to represent value, which is quantized value for continuous feature and category for discrete feature. In DTStatsAggregator, only collect stats of bins. At the stage, all operators are same, no matter it is continuous or discrete feature. 2. in `binsToBestSplit`, + Continuous / order discrete feature has N bins, and then construct N - 1 splits. + Unorder discrete feature has N bins, and then construct all possible combination, namely, 2^(N-1) - 1 splits. 3. in `binsToBestSplit`, collect all splits and calculate their impurity. Order these splits and find the best one. > DecisionTree aggregate size could be smaller > > > Key: SPARK-3383 > URL: https://issues.apache.org/jira/browse/SPARK-3383 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.1.0 >Reporter: Joseph K. Bradley >Priority: Minor > > Storage and communication optimization: > DecisionTree aggregate statistics could store less data (described below). > The savings would be significant for datasets with many low-arity categorical > features (binary features, or unordered categorical features). Savings would > be negligible for continuous features. > DecisionTree stores a vector sufficient statistics for each (node, feature, > bin). We could store 1 fewer bin per (node, feature): For a given (node, > feature), if we store these vectors for all but the last bin, and also store > the total statistics for each node, then we could compute the statistics for > the last bin. For binary and unordered categorical features, this would cut > in half the number of bins to store and communicate. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-20294) _inferSchema for RDDs fails if sample returns empty RDD
[ https://issues.apache.org/jira/browse/SPARK-20294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15965259#comment-15965259 ] Hyukjin Kwon edited comment on SPARK-20294 at 4/12/17 2:00 AM: --- I am resolving this because this does not look a problem. I guess we are unable to infer the schema when the data is empty. I think the workaround can be ... {code} >>> small_rdd = sc.parallelize([(1, '2'), (2, 'foo')]) >>> small_rdd.toDF(schema, sampleRatio=0.1).show() +---+---+ | _1| _2| +---+---+ | 1| 2| | 2|foo| +---+---+ {code} if it needs 1%. Please reopen this if anyone thinks differently or I misunderstood. was (Author: hyukjin.kwon): I am resolving this because this does not look a problem. I guess we are unable to infer the schema when the data is empty. I think the workaround can be ... {code} >>> small_rdd = sc.parallelize([(1, '2'), (2, 'foo')]) >>> schema = small_rdd.toDF().schema >>> small_rdd.toDF(schema).show() +---+---+ | _1| _2| +---+---+ | 1| 2| | 2|foo| +---+---+ {code} or maybe {code} >>> small_rdd = sc.parallelize([(1, '2'), (2, 'foo')]) >>> schema = small_rdd.toDF(sampleRatio=0.5).schema >>> small_rdd.toDF(schema).show() +---+---+ | _1| _2| +---+---+ | 1| 2| | 2|foo| +---+---+ {code} Please reopen this if anyone thinks differently or I misunderstood. > _inferSchema for RDDs fails if sample returns empty RDD > --- > > Key: SPARK-20294 > URL: https://issues.apache.org/jira/browse/SPARK-20294 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.1.0 >Reporter: João Pedro Jericó >Priority: Minor > > Currently the _inferSchema function on > [session.py](https://github.com/apache/spark/blob/master/python/pyspark/sql/session.py#L354) > line 354 fails if applied to an RDD for which the sample call returns an > empty RDD. This is possible for example if one has a small RDD but that needs > the schema to be inferred by more than one Row. For example: > ```python > small_rdd = sc.parallelize([(1, 2), (2, 'foo')]) > small_rdd.toDF(samplingRatio=0.01).show() > ``` > This will fail with high probability because when sampling the small_rdd with > the .sample method it will return an empty RDD most of the time. However, > this is not the desired result because we are able to sample at least 1% of > the RDD. > This is probably a problem with the other Spark APIs however I don't have the > knowledge to look at the source code for other languages. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20294) _inferSchema for RDDs fails if sample returns empty RDD
[ https://issues.apache.org/jira/browse/SPARK-20294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-20294. -- Resolution: Invalid I am resolving this because this does not look a problem. I guess we are unable to infer the schema when the data is empty. I think the workaround can be ... {code} >>> small_rdd = sc.parallelize([(1, '2'), (2, 'foo')]) >>> schema = small_rdd.toDF().schema >>> small_rdd.toDF(schema).show() +---+---+ | _1| _2| +---+---+ | 1| 2| | 2|foo| +---+---+ {code} or maybe {code} >>> small_rdd = sc.parallelize([(1, '2'), (2, 'foo')]) >>> schema = small_rdd.toDF(sampleRatio=0.5).schema >>> small_rdd.toDF(schema).show() +---+---+ | _1| _2| +---+---+ | 1| 2| | 2|foo| +---+---+ {code} Please reopen this if anyone thinks differently or I misunderstood. > _inferSchema for RDDs fails if sample returns empty RDD > --- > > Key: SPARK-20294 > URL: https://issues.apache.org/jira/browse/SPARK-20294 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.1.0 >Reporter: João Pedro Jericó >Priority: Minor > > Currently the _inferSchema function on > [session.py](https://github.com/apache/spark/blob/master/python/pyspark/sql/session.py#L354) > line 354 fails if applied to an RDD for which the sample call returns an > empty RDD. This is possible for example if one has a small RDD but that needs > the schema to be inferred by more than one Row. For example: > ```python > small_rdd = sc.parallelize([(1, 2), (2, 'foo')]) > small_rdd.toDF(samplingRatio=0.01).show() > ``` > This will fail with high probability because when sampling the small_rdd with > the .sample method it will return an empty RDD most of the time. However, > this is not the desired result because we are able to sample at least 1% of > the RDD. > This is probably a problem with the other Spark APIs however I don't have the > knowledge to look at the source code for other languages. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20297) Parquet Decimal(12,2) written by Spark is unreadable by Hive and Impala
[ https://issues.apache.org/jira/browse/SPARK-20297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15965255#comment-15965255 ] Mostafa Mokhtar commented on SPARK-20297: - [~hyukjin.kwon] Data written by Spark is readable by Hive and Impala when spark.sql.parquet.writeLegacyFormat is enabled. > Parquet Decimal(12,2) written by Spark is unreadable by Hive and Impala > --- > > Key: SPARK-20297 > URL: https://issues.apache.org/jira/browse/SPARK-20297 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Mostafa Mokhtar > Labels: integration > > While trying to load some data using Spark 2.1 I realized that decimal(12,2) > columns stored in Parquet written by Spark are not readable by Hive or Impala. > Repro > {code} > CREATE TABLE customer_acctbal( > c_acctbal decimal(12,2)) > STORED AS Parquet; > insert into customer_acctbal values (7539.95); > {code} > Error from Hive > {code} > Failed with exception > java.io.IOException:parquet.io.ParquetDecodingException: Can not read value > at 1 in block 0 in file > hdfs://server1:8020/user/hive/warehouse/tpch_nested_3000_parquet.db/customer_acctbal/part-0-03d6e3bb-fe5e-4f20-87a4-88dec955dfcd.snappy.parquet > Time taken: 0.122 seconds > {code} > Error from Impala > {code} > File > 'hdfs://server:8020/user/hive/warehouse/tpch_nested_3000_parquet.db/customer_acctbal/part-0-32db4c61-fe67-4be2-9c16-b55c75c517a4.snappy.parquet' > has an incompatible Parquet schema for column > 'tpch_nested_3000_parquet.customer_acctbal.c_acctbal'. Column type: > DECIMAL(12,2), Parquet schema: > optional int64 c_acctbal [i:0 d:1 r:0] (1 of 2 similar) > {code} > Table info > {code} > hive> describe formatted customer_acctbal; > OK > # col_name data_type comment > c_acctbal decimal(12,2) > # Detailed Table Information > Database: tpch_nested_3000_parquet > Owner: mmokhtar > CreateTime: Mon Apr 10 17:47:24 PDT 2017 > LastAccessTime: UNKNOWN > Protect Mode: None > Retention: 0 > Location: > hdfs://server1.com:8020/user/hive/warehouse/tpch_nested_3000_parquet.db/customer_acctbal > Table Type: MANAGED_TABLE > Table Parameters: > COLUMN_STATS_ACCURATE true > numFiles1 > numRows 0 > rawDataSize 0 > totalSize 120 > transient_lastDdlTime 1491871644 > # Storage Information > SerDe Library: > org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe > InputFormat: > org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat > OutputFormat: > org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat > Compressed: No > Num Buckets:-1 > Bucket Columns: [] > Sort Columns: [] > Storage Desc Params: > serialization.format1 > Time taken: 0.032 seconds, Fetched: 31 row(s) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20294) _inferSchema for RDDs fails if sample returns empty RDD
[ https://issues.apache.org/jira/browse/SPARK-20294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15965257#comment-15965257 ] Hyukjin Kwon commented on SPARK-20294: -- Just for other guys, {code} >>> small_rdd = sc.parallelize([(1, 2), (2, 'foo')]) >>> small_rdd.toDF(sampleRatio=0.01).show() Traceback (most recent call last): File "", line 1, in File ".../spark/python/pyspark/sql/session.py", line 57, in toDF return sparkSession.createDataFrame(self, schema, sampleRatio) File ".../spark/python/pyspark/sql/session.py", line 524, in createDataFrame rdd, schema = self._createFromRDD(data.map(prepare), schema, samplingRatio) File ".../spark/python/pyspark/sql/session.py", line 364, in _createFromRDD struct = self._inferSchema(rdd, samplingRatio) File ".../spark/python/pyspark/sql/session.py", line 356, in _inferSchema schema = rdd.map(_infer_schema).reduce(_merge_type) File ".../spark/python/pyspark/rdd.py", line 838, in reduce raise ValueError("Can not reduce() empty RDD") ValueError: Can not reduce() empty RDD {code} > _inferSchema for RDDs fails if sample returns empty RDD > --- > > Key: SPARK-20294 > URL: https://issues.apache.org/jira/browse/SPARK-20294 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.1.0 >Reporter: João Pedro Jericó >Priority: Minor > > Currently the _inferSchema function on > [session.py](https://github.com/apache/spark/blob/master/python/pyspark/sql/session.py#L354) > line 354 fails if applied to an RDD for which the sample call returns an > empty RDD. This is possible for example if one has a small RDD but that needs > the schema to be inferred by more than one Row. For example: > ```python > small_rdd = sc.parallelize([(1, 2), (2, 'foo')]) > small_rdd.toDF(samplingRatio=0.01).show() > ``` > This will fail with high probability because when sampling the small_rdd with > the .sample method it will return an empty RDD most of the time. However, > this is not the desired result because we are able to sample at least 1% of > the RDD. > This is probably a problem with the other Spark APIs however I don't have the > knowledge to look at the source code for other languages. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20297) Parquet Decimal(12,2) written by Spark is unreadable by Hive and Impala
[ https://issues.apache.org/jira/browse/SPARK-20297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-20297: - Priority: Major (was: Critical) > Parquet Decimal(12,2) written by Spark is unreadable by Hive and Impala > --- > > Key: SPARK-20297 > URL: https://issues.apache.org/jira/browse/SPARK-20297 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Mostafa Mokhtar > Labels: integration > > While trying to load some data using Spark 2.1 I realized that decimal(12,2) > columns stored in Parquet written by Spark are not readable by Hive or Impala. > Repro > {code} > CREATE TABLE customer_acctbal( > c_acctbal decimal(12,2)) > STORED AS Parquet; > insert into customer_acctbal values (7539.95); > {code} > Error from Hive > {code} > Failed with exception > java.io.IOException:parquet.io.ParquetDecodingException: Can not read value > at 1 in block 0 in file > hdfs://server1:8020/user/hive/warehouse/tpch_nested_3000_parquet.db/customer_acctbal/part-0-03d6e3bb-fe5e-4f20-87a4-88dec955dfcd.snappy.parquet > Time taken: 0.122 seconds > {code} > Error from Impala > {code} > File > 'hdfs://server:8020/user/hive/warehouse/tpch_nested_3000_parquet.db/customer_acctbal/part-0-32db4c61-fe67-4be2-9c16-b55c75c517a4.snappy.parquet' > has an incompatible Parquet schema for column > 'tpch_nested_3000_parquet.customer_acctbal.c_acctbal'. Column type: > DECIMAL(12,2), Parquet schema: > optional int64 c_acctbal [i:0 d:1 r:0] (1 of 2 similar) > {code} > Table info > {code} > hive> describe formatted customer_acctbal; > OK > # col_name data_type comment > c_acctbal decimal(12,2) > # Detailed Table Information > Database: tpch_nested_3000_parquet > Owner: mmokhtar > CreateTime: Mon Apr 10 17:47:24 PDT 2017 > LastAccessTime: UNKNOWN > Protect Mode: None > Retention: 0 > Location: > hdfs://server1.com:8020/user/hive/warehouse/tpch_nested_3000_parquet.db/customer_acctbal > Table Type: MANAGED_TABLE > Table Parameters: > COLUMN_STATS_ACCURATE true > numFiles1 > numRows 0 > rawDataSize 0 > totalSize 120 > transient_lastDdlTime 1491871644 > # Storage Information > SerDe Library: > org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe > InputFormat: > org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat > OutputFormat: > org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat > Compressed: No > Num Buckets:-1 > Bucket Columns: [] > Sort Columns: [] > Storage Desc Params: > serialization.format1 > Time taken: 0.032 seconds, Fetched: 31 row(s) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20297) Parquet Decimal(12,2) written by Spark is unreadable by Hive and Impala
[ https://issues.apache.org/jira/browse/SPARK-20297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-20297: - Component/s: (was: Spark Core) SQL > Parquet Decimal(12,2) written by Spark is unreadable by Hive and Impala > --- > > Key: SPARK-20297 > URL: https://issues.apache.org/jira/browse/SPARK-20297 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Mostafa Mokhtar >Priority: Critical > Labels: integration > > While trying to load some data using Spark 2.1 I realized that decimal(12,2) > columns stored in Parquet written by Spark are not readable by Hive or Impala. > Repro > {code} > CREATE TABLE customer_acctbal( > c_acctbal decimal(12,2)) > STORED AS Parquet; > insert into customer_acctbal values (7539.95); > {code} > Error from Hive > {code} > Failed with exception > java.io.IOException:parquet.io.ParquetDecodingException: Can not read value > at 1 in block 0 in file > hdfs://server1:8020/user/hive/warehouse/tpch_nested_3000_parquet.db/customer_acctbal/part-0-03d6e3bb-fe5e-4f20-87a4-88dec955dfcd.snappy.parquet > Time taken: 0.122 seconds > {code} > Error from Impala > {code} > File > 'hdfs://server:8020/user/hive/warehouse/tpch_nested_3000_parquet.db/customer_acctbal/part-0-32db4c61-fe67-4be2-9c16-b55c75c517a4.snappy.parquet' > has an incompatible Parquet schema for column > 'tpch_nested_3000_parquet.customer_acctbal.c_acctbal'. Column type: > DECIMAL(12,2), Parquet schema: > optional int64 c_acctbal [i:0 d:1 r:0] (1 of 2 similar) > {code} > Table info > {code} > hive> describe formatted customer_acctbal; > OK > # col_name data_type comment > c_acctbal decimal(12,2) > # Detailed Table Information > Database: tpch_nested_3000_parquet > Owner: mmokhtar > CreateTime: Mon Apr 10 17:47:24 PDT 2017 > LastAccessTime: UNKNOWN > Protect Mode: None > Retention: 0 > Location: > hdfs://server1.com:8020/user/hive/warehouse/tpch_nested_3000_parquet.db/customer_acctbal > Table Type: MANAGED_TABLE > Table Parameters: > COLUMN_STATS_ACCURATE true > numFiles1 > numRows 0 > rawDataSize 0 > totalSize 120 > transient_lastDdlTime 1491871644 > # Storage Information > SerDe Library: > org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe > InputFormat: > org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat > OutputFormat: > org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat > Compressed: No > Num Buckets:-1 > Bucket Columns: [] > Sort Columns: [] > Storage Desc Params: > serialization.format1 > Time taken: 0.032 seconds, Fetched: 31 row(s) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20297) Parquet Decimal(12,2) written by Spark is unreadable by Hive and Impala
[ https://issues.apache.org/jira/browse/SPARK-20297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15965245#comment-15965245 ] Hyukjin Kwon commented on SPARK-20297: -- For me, it sounds like related with {{spark.sql.parquet.writeLegacyFormat}}. I haven't tested and double-checked it by myself but I assume Hive guesses the decimal as fixed-bytes but Spark actually writes out them as INT32 for 1 <= precision <= 9 and INT64 for 10 <= precision <= 18. Do you mind if I ask to try out with {{spark.sql.parquet.writeLegacyFormat}} enabled? > Parquet Decimal(12,2) written by Spark is unreadable by Hive and Impala > --- > > Key: SPARK-20297 > URL: https://issues.apache.org/jira/browse/SPARK-20297 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Mostafa Mokhtar >Priority: Critical > Labels: integration > > While trying to load some data using Spark 2.1 I realized that decimal(12,2) > columns stored in Parquet written by Spark are not readable by Hive or Impala. > Repro > {code} > CREATE TABLE customer_acctbal( > c_acctbal decimal(12,2)) > STORED AS Parquet; > insert into customer_acctbal values (7539.95); > {code} > Error from Hive > {code} > Failed with exception > java.io.IOException:parquet.io.ParquetDecodingException: Can not read value > at 1 in block 0 in file > hdfs://server1:8020/user/hive/warehouse/tpch_nested_3000_parquet.db/customer_acctbal/part-0-03d6e3bb-fe5e-4f20-87a4-88dec955dfcd.snappy.parquet > Time taken: 0.122 seconds > {code} > Error from Impala > {code} > File > 'hdfs://server:8020/user/hive/warehouse/tpch_nested_3000_parquet.db/customer_acctbal/part-0-32db4c61-fe67-4be2-9c16-b55c75c517a4.snappy.parquet' > has an incompatible Parquet schema for column > 'tpch_nested_3000_parquet.customer_acctbal.c_acctbal'. Column type: > DECIMAL(12,2), Parquet schema: > optional int64 c_acctbal [i:0 d:1 r:0] (1 of 2 similar) > {code} > Table info > {code} > hive> describe formatted customer_acctbal; > OK > # col_name data_type comment > c_acctbal decimal(12,2) > # Detailed Table Information > Database: tpch_nested_3000_parquet > Owner: mmokhtar > CreateTime: Mon Apr 10 17:47:24 PDT 2017 > LastAccessTime: UNKNOWN > Protect Mode: None > Retention: 0 > Location: > hdfs://server1.com:8020/user/hive/warehouse/tpch_nested_3000_parquet.db/customer_acctbal > Table Type: MANAGED_TABLE > Table Parameters: > COLUMN_STATS_ACCURATE true > numFiles1 > numRows 0 > rawDataSize 0 > totalSize 120 > transient_lastDdlTime 1491871644 > # Storage Information > SerDe Library: > org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe > InputFormat: > org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat > OutputFormat: > org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat > Compressed: No > Num Buckets:-1 > Bucket Columns: [] > Sort Columns: [] > Storage Desc Params: > serialization.format1 > Time taken: 0.032 seconds, Fetched: 31 row(s) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-10788) Decision Tree duplicates bins for unordered categorical features
[ https://issues.apache.org/jira/browse/SPARK-10788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15965244#comment-15965244 ] Yan Facai (颜发才) edited comment on SPARK-10788 at 4/12/17 1:35 AM: -- [~josephkb] As categories A, B and C are independent, why not collect statistics only for cateogry? I mean 1 bin per category, instead of 1 bin per split. Splits are calculated in the last step in `binsToBestSplit`. So communication cost is N bins. was (Author: facai): [~josephkb] As categories A, B and C are independent, why not collect statistics only for cateogry? Splits are calculated in the last step in `binsToBestSplit`. So communication cost is N bins. > Decision Tree duplicates bins for unordered categorical features > > > Key: SPARK-10788 > URL: https://issues.apache.org/jira/browse/SPARK-10788 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley >Assignee: Seth Hendrickson >Priority: Minor > Fix For: 2.0.0 > > > Decision trees in spark.ml (RandomForest.scala) communicate twice as much > data as needed for unordered categorical features. Here's an example. > Say there are 3 categories A, B, C. We consider 3 splits: > * A vs. B, C > * A, B vs. C > * A, C vs. B > Currently, we collect statistics for each of the 6 subsets of categories (3 * > 2 = 6). However, we could instead collect statistics for the 3 subsets on > the left-hand side of the 3 possible splits: A and A,B and A,C. If we also > have stats for the entire node, then we can compute the stats for the 3 > subsets on the right-hand side of the splits. In pseudomath: {{stats(B,C) = > stats(A,B,C) - stats(A)}}. > We should eliminate these extra bins within the spark.ml implementation since > the spark.mllib implementation will be removed before long (and will instead > call into spark.ml). -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10788) Decision Tree duplicates bins for unordered categorical features
[ https://issues.apache.org/jira/browse/SPARK-10788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15965244#comment-15965244 ] Yan Facai (颜发才) commented on SPARK-10788: - [~josephkb] As categories A, B and C are independent, why not collect statistics only for cateogry? Splits are calculated in the last step in `binsToBestSplit`. So communication cost is N bins. > Decision Tree duplicates bins for unordered categorical features > > > Key: SPARK-10788 > URL: https://issues.apache.org/jira/browse/SPARK-10788 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley >Assignee: Seth Hendrickson >Priority: Minor > Fix For: 2.0.0 > > > Decision trees in spark.ml (RandomForest.scala) communicate twice as much > data as needed for unordered categorical features. Here's an example. > Say there are 3 categories A, B, C. We consider 3 splits: > * A vs. B, C > * A, B vs. C > * A, C vs. B > Currently, we collect statistics for each of the 6 subsets of categories (3 * > 2 = 6). However, we could instead collect statistics for the 3 subsets on > the left-hand side of the 3 possible splits: A and A,B and A,C. If we also > have stats for the entire node, then we can compute the stats for the 3 > subsets on the right-hand side of the splits. In pseudomath: {{stats(B,C) = > stats(A,B,C) - stats(A)}}. > We should eliminate these extra bins within the spark.ml implementation since > the spark.mllib implementation will be removed before long (and will instead > call into spark.ml). -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20295) when spark.sql.adaptive.enabled is enabled, have conflict with Exchange Resue
[ https://issues.apache.org/jira/browse/SPARK-20295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruhui Wang updated SPARK-20295: --- Description: when run tpcds-q95, and set spark.sql.adaptive.enabled = true the physical plan firstly: Sort : +- Exchange(coordinator id: 1) : +- Project*** ::-Sort ** :: +- Exchange(coordinator id: 2) :: :- Project *** :+- Sort :: +- Exchange(coordinator id: 3) spark.sql.exchange.reuse is opened, then physical plan will become below: Sort : +- Exchange(coordinator id: 1) : +- Project*** ::-Sort ** :: +- Exchange(coordinator id: 2) :: :- Project *** :+- Sort :: +- ReusedExchange Exchange(coordinator id: 2) If spark.sql.adaptive.enabled = true, the code stack is : ShuffleExchange#doExecute --> postShuffleRDD function --> doEstimationIfNecessary . In this function, assert(exchanges.length == numExchanges) will be error, as left side has only one element, but right is equal to 2. If this is a bug of spark.sql.adaptive.enabled and exchange resue? was: when spark.sql.exchange.reuse is opened, then run a query with self join(such as tpcds-q95), the physical plan will become below randomly: WholeStageCodegen : +- Project [id#0L] : +- BroadcastHashJoin [id#0L], [id#2L], Inner, BuildRight, None ::- Project [id#0L] :: +- BroadcastHashJoin [id#0L], [id#1L], Inner, BuildRight, None :: :- Range 0, 1, 4, 1024, [id#0L] :: +- INPUT :+- INPUT :- BroadcastExchange HashedRelationBroadcastMode(true,List(id#1L),List(id#1L)) : +- WholeStageCodegen : : +- Range 0, 1, 4, 1024, [id#1L] +- ReusedExchange [id#2L], BroadcastExchange HashedRelationBroadcastMode(true,List(id#1L),List(id#1L)) If spark.sql.adaptive.enabled = true, the code stack is : ShuffleExchange#doExecute --> postShuffleRDD function --> doEstimationIfNecessary . In this function, assert(exchanges.length == numExchanges) will be error, as left side has only one element, but right is equal to 2. If this is a bug of spark.sql.adaptive.enabled and exchange resue? > when spark.sql.adaptive.enabled is enabled, have conflict with Exchange Resue > -- > > Key: SPARK-20295 > URL: https://issues.apache.org/jira/browse/SPARK-20295 > Project: Spark > Issue Type: Bug > Components: Shuffle, SQL >Affects Versions: 2.1.0 >Reporter: Ruhui Wang > > when run tpcds-q95, and set spark.sql.adaptive.enabled = true the physical > plan firstly: > Sort > : +- Exchange(coordinator id: 1) > : +- Project*** > ::-Sort ** > :: +- Exchange(coordinator id: 2) > :: :- Project *** > :+- Sort > :: +- Exchange(coordinator id: 3) > spark.sql.exchange.reuse is opened, then physical plan will become below: > Sort > : +- Exchange(coordinator id: 1) > : +- Project*** > ::-Sort ** > :: +- Exchange(coordinator id: 2) > :: :- Project *** > :+- Sort > :: +- ReusedExchange Exchange(coordinator id: 2) > If spark.sql.adaptive.enabled = true, the code stack is : > ShuffleExchange#doExecute --> postShuffleRDD function --> > doEstimationIfNecessary . In this function, > assert(exchanges.length == numExchanges) will be error, as left side has only > one element, but right is equal to 2. > If this is a bug of spark.sql.adaptive.enabled and exchange resue? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20202) Remove references to org.spark-project.hive
[ https://issues.apache.org/jira/browse/SPARK-20202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15965220#comment-15965220 ] Reynold Xin commented on SPARK-20202: - There are no currently targeted version, are there? > Remove references to org.spark-project.hive > --- > > Key: SPARK-20202 > URL: https://issues.apache.org/jira/browse/SPARK-20202 > Project: Spark > Issue Type: Bug > Components: Build, SQL >Affects Versions: 1.6.4, 2.0.3, 2.1.1 >Reporter: Owen O'Malley > > Spark can't continue to depend on their fork of Hive and must move to > standard Hive versions. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20202) Remove references to org.spark-project.hive
[ https://issues.apache.org/jira/browse/SPARK-20202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15965218#comment-15965218 ] holdenk commented on SPARK-20202: - Would it possible make sense to untarget this from the maintenance releases (1.6.X, 2.0.X, 2.1.X) and instead focus on the future versions? > Remove references to org.spark-project.hive > --- > > Key: SPARK-20202 > URL: https://issues.apache.org/jira/browse/SPARK-20202 > Project: Spark > Issue Type: Bug > Components: Build, SQL >Affects Versions: 1.6.4, 2.0.3, 2.1.1 >Reporter: Owen O'Malley > > Spark can't continue to depend on their fork of Hive and must move to > standard Hive versions. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7128) Add generic bagging algorithm to spark.ml
[ https://issues.apache.org/jira/browse/SPARK-7128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15965121#comment-15965121 ] yuhao yang commented on SPARK-7128: --- I would vote for adding this now. This is quite helpful in practical applications like fraud detection, and feynmanliang has started with a solid prototype. I can help finish it if this is on the roadmap. > Add generic bagging algorithm to spark.ml > - > > Key: SPARK-7128 > URL: https://issues.apache.org/jira/browse/SPARK-7128 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Joseph K. Bradley > > The Pipelines API will make it easier to create a generic Bagging algorithm > which can work with any Classifier or Regressor. Creating this feature will > require researching the possible variants and extensions of bagging which we > may want to support now and/or in the future, and planning an API which will > be properly extensible. > Note: This may interact some with the existing tree ensemble methods, but it > should be largely separate since the tree ensemble APIs and implementations > are specialized for trees. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20300) Python API for ALSModel.recommendForAllUsers,Items
Joseph K. Bradley created SPARK-20300: - Summary: Python API for ALSModel.recommendForAllUsers,Items Key: SPARK-20300 URL: https://issues.apache.org/jira/browse/SPARK-20300 Project: Spark Issue Type: New Feature Components: ML, PySpark Affects Versions: 2.2.0 Reporter: Joseph K. Bradley Python API for ALSModel methods recommendForAllUsers, recommendForAllItems -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20256) Fail to start SparkContext/SparkSession with Hive support enabled when user does not have read/write privilege to Hive metastore warehouse dir
[ https://issues.apache.org/jira/browse/SPARK-20256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15964969#comment-15964969 ] Xin Wu commented on SPARK-20256: Yes. I am working on it. My proposal is to revert the SPARK-18050 change, then add a try-catch over externalCatalog.createDatabase(...) and log the error of existing default database from Hive into DEBUG log. I am trying to create a unit-test case to simulate the permission issue, which I have some difficulty. > Fail to start SparkContext/SparkSession with Hive support enabled when user > does not have read/write privilege to Hive metastore warehouse dir > -- > > Key: SPARK-20256 > URL: https://issues.apache.org/jira/browse/SPARK-20256 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0, 2.1.1, 2.2.0 >Reporter: Xin Wu >Priority: Critical > > In a cluster setup with production Hive running, when the user wants to run > spark-shell using the production Hive metastore, hive-site.xml is copied to > SPARK_HOME/conf. So when spark-shell is being started, it tries to check > database existence of "default" database from Hive metastore. Yet, since this > user may not have READ/WRITE access to the configured Hive warehouse > directory done by Hive itself, such permission error will prevent spark-shell > or any spark application with Hive support enabled from starting at all. > Example error: > {code}To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use > setLogLevel(newLevel). > java.lang.IllegalArgumentException: Error while instantiating > 'org.apache.spark.sql.hive.HiveSessionState': > at > org.apache.spark.sql.SparkSession$.org$apache$spark$sql$SparkSession$$reflect(SparkSession.scala:981) > at > org.apache.spark.sql.SparkSession.sessionState$lzycompute(SparkSession.scala:110) > at org.apache.spark.sql.SparkSession.sessionState(SparkSession.scala:109) > at > org.apache.spark.sql.SparkSession$Builder$$anonfun$getOrCreate$5.apply(SparkSession.scala:878) > at > org.apache.spark.sql.SparkSession$Builder$$anonfun$getOrCreate$5.apply(SparkSession.scala:878) > at > scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99) > at > scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99) > at > scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:230) > at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40) > at scala.collection.mutable.HashMap.foreach(HashMap.scala:99) > at > org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:878) > at org.apache.spark.repl.Main$.createSparkSession(Main.scala:95) > ... 47 elided > Caused by: java.lang.reflect.InvocationTargetException: > org.apache.spark.sql.AnalysisException: > org.apache.hadoop.hive.ql.metadata.HiveException: > MetaException(message:java.security.AccessControlException: Permission > denied: user=notebook, access=READ, > inode="/apps/hive/warehouse":hive:hadoop:drwxrwx--- > at > org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:320) > at > org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:219) > at > org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:190) > at > org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1728) > at > org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1712) > at > org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPathAccess(FSDirectory.java:1686) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkAccess(FSNamesystem.java:8238) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.checkAccess(NameNodeRpcServer.java:1933) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.checkAccess(ClientNamenodeProtocolServerSideTranslatorPB.java:1455) > at > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at >
[jira] [Created] (SPARK-20299) NullPointerException when null and string are in a tuple while encoding Dataset
Jacek Laskowski created SPARK-20299: --- Summary: NullPointerException when null and string are in a tuple while encoding Dataset Key: SPARK-20299 URL: https://issues.apache.org/jira/browse/SPARK-20299 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.2.0 Reporter: Jacek Laskowski Priority: Minor When creating a Dataset from a tuple with {{null}} and a string, NPE is reported. When either is removed, it works fine. {code} scala> Seq((1, null.asInstanceOf[Int]), (2, 1)).toDS res43: org.apache.spark.sql.Dataset[(Int, Int)] = [_1: int, _2: int] scala> Seq(("1", null.asInstanceOf[Int]), ("2", 1)).toDS scala> Seq(("1", null.asInstanceOf[Int]), ("2", 1)).toDS java.lang.RuntimeException: Error while encoding: java.lang.NullPointerException staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(assertnotnull(input[0, scala.Tuple2, true], top level Product input object), - root class: "scala.Tuple2")._1, true) AS _1#474 assertnotnull(assertnotnull(input[0, scala.Tuple2, true], top level Product input object), - root class: "scala.Tuple2")._2 AS _2#475 at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:290) at org.apache.spark.sql.SparkSession$$anonfun$2.apply(SparkSession.scala:454) at org.apache.spark.sql.SparkSession$$anonfun$2.apply(SparkSession.scala:454) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.immutable.List.foreach(List.scala:381) at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at scala.collection.immutable.List.map(List.scala:285) at org.apache.spark.sql.SparkSession.createDataset(SparkSession.scala:454) at org.apache.spark.sql.SQLContext.createDataset(SQLContext.scala:377) at org.apache.spark.sql.SQLImplicits.localSeqToDatasetHolder(SQLImplicits.scala:246) ... 48 elided Caused by: java.lang.NullPointerException at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply_1$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source) at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:287) ... 58 more {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20256) Fail to start SparkContext/SparkSession with Hive support enabled when user does not have read/write privilege to Hive metastore warehouse dir
[ https://issues.apache.org/jira/browse/SPARK-20256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15964945#comment-15964945 ] Dongjoon Hyun commented on SPARK-20256: --- Hi, [~xwu0226]. Is there any progress? > Fail to start SparkContext/SparkSession with Hive support enabled when user > does not have read/write privilege to Hive metastore warehouse dir > -- > > Key: SPARK-20256 > URL: https://issues.apache.org/jira/browse/SPARK-20256 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0, 2.1.1, 2.2.0 >Reporter: Xin Wu >Priority: Critical > > In a cluster setup with production Hive running, when the user wants to run > spark-shell using the production Hive metastore, hive-site.xml is copied to > SPARK_HOME/conf. So when spark-shell is being started, it tries to check > database existence of "default" database from Hive metastore. Yet, since this > user may not have READ/WRITE access to the configured Hive warehouse > directory done by Hive itself, such permission error will prevent spark-shell > or any spark application with Hive support enabled from starting at all. > Example error: > {code}To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use > setLogLevel(newLevel). > java.lang.IllegalArgumentException: Error while instantiating > 'org.apache.spark.sql.hive.HiveSessionState': > at > org.apache.spark.sql.SparkSession$.org$apache$spark$sql$SparkSession$$reflect(SparkSession.scala:981) > at > org.apache.spark.sql.SparkSession.sessionState$lzycompute(SparkSession.scala:110) > at org.apache.spark.sql.SparkSession.sessionState(SparkSession.scala:109) > at > org.apache.spark.sql.SparkSession$Builder$$anonfun$getOrCreate$5.apply(SparkSession.scala:878) > at > org.apache.spark.sql.SparkSession$Builder$$anonfun$getOrCreate$5.apply(SparkSession.scala:878) > at > scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99) > at > scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99) > at > scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:230) > at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40) > at scala.collection.mutable.HashMap.foreach(HashMap.scala:99) > at > org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:878) > at org.apache.spark.repl.Main$.createSparkSession(Main.scala:95) > ... 47 elided > Caused by: java.lang.reflect.InvocationTargetException: > org.apache.spark.sql.AnalysisException: > org.apache.hadoop.hive.ql.metadata.HiveException: > MetaException(message:java.security.AccessControlException: Permission > denied: user=notebook, access=READ, > inode="/apps/hive/warehouse":hive:hadoop:drwxrwx--- > at > org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:320) > at > org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:219) > at > org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:190) > at > org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1728) > at > org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1712) > at > org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPathAccess(FSDirectory.java:1686) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkAccess(FSNamesystem.java:8238) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.checkAccess(NameNodeRpcServer.java:1933) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.checkAccess(ClientNamenodeProtocolServerSideTranslatorPB.java:1455) > at > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1697) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2045) > ); > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at >
[jira] [Assigned] (SPARK-20298) Spelling mistake: charactor
[ https://issues.apache.org/jira/browse/SPARK-20298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20298: Assignee: Apache Spark > Spelling mistake: charactor > --- > > Key: SPARK-20298 > URL: https://issues.apache.org/jira/browse/SPARK-20298 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.1.0 >Reporter: Brendan Dwyer >Assignee: Apache Spark >Priority: Trivial > > "charactor" should be "character" > {code} > R/pkg/R/DataFrame.R:2821: stop("path should be charactor, NULL > or omitted.") > R/pkg/R/DataFrame.R:2828: stop("mode should be charactor or > omitted. It is 'error' by default.") > R/pkg/R/DataFrame.R:3043: stop("value should be an integer, > numeric, charactor or named list.") > R/pkg/R/DataFrame.R:3055: stop("Each item in value should be > an integer, numeric or charactor.") > R/pkg/R/DataFrame.R:3601: stop("outputMode should be charactor > or omitted.") > R/pkg/R/SQLContext.R:609:stop("path should be charactor, NULL or > omitted.") > R/pkg/inst/tests/testthat/test_sparkSQL.R:2929: "path should be > charactor, NULL or omitted.") > R/pkg/inst/tests/testthat/test_sparkSQL.R:2931: "mode should be > charactor or omitted. It is 'error' by default.") > R/pkg/inst/tests/testthat/test_sparkSQL.R:2950: "path should be > charactor, NULL or omitted.") > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20298) Spelling mistake: charactor
[ https://issues.apache.org/jira/browse/SPARK-20298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20298: Assignee: (was: Apache Spark) > Spelling mistake: charactor > --- > > Key: SPARK-20298 > URL: https://issues.apache.org/jira/browse/SPARK-20298 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.1.0 >Reporter: Brendan Dwyer >Priority: Trivial > > "charactor" should be "character" > {code} > R/pkg/R/DataFrame.R:2821: stop("path should be charactor, NULL > or omitted.") > R/pkg/R/DataFrame.R:2828: stop("mode should be charactor or > omitted. It is 'error' by default.") > R/pkg/R/DataFrame.R:3043: stop("value should be an integer, > numeric, charactor or named list.") > R/pkg/R/DataFrame.R:3055: stop("Each item in value should be > an integer, numeric or charactor.") > R/pkg/R/DataFrame.R:3601: stop("outputMode should be charactor > or omitted.") > R/pkg/R/SQLContext.R:609:stop("path should be charactor, NULL or > omitted.") > R/pkg/inst/tests/testthat/test_sparkSQL.R:2929: "path should be > charactor, NULL or omitted.") > R/pkg/inst/tests/testthat/test_sparkSQL.R:2931: "mode should be > charactor or omitted. It is 'error' by default.") > R/pkg/inst/tests/testthat/test_sparkSQL.R:2950: "path should be > charactor, NULL or omitted.") > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20298) Spelling mistake: charactor
[ https://issues.apache.org/jira/browse/SPARK-20298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15964921#comment-15964921 ] Apache Spark commented on SPARK-20298: -- User 'bdwyer2' has created a pull request for this issue: https://github.com/apache/spark/pull/17611 > Spelling mistake: charactor > --- > > Key: SPARK-20298 > URL: https://issues.apache.org/jira/browse/SPARK-20298 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.1.0 >Reporter: Brendan Dwyer >Priority: Trivial > > "charactor" should be "character" > {code} > R/pkg/R/DataFrame.R:2821: stop("path should be charactor, NULL > or omitted.") > R/pkg/R/DataFrame.R:2828: stop("mode should be charactor or > omitted. It is 'error' by default.") > R/pkg/R/DataFrame.R:3043: stop("value should be an integer, > numeric, charactor or named list.") > R/pkg/R/DataFrame.R:3055: stop("Each item in value should be > an integer, numeric or charactor.") > R/pkg/R/DataFrame.R:3601: stop("outputMode should be charactor > or omitted.") > R/pkg/R/SQLContext.R:609:stop("path should be charactor, NULL or > omitted.") > R/pkg/inst/tests/testthat/test_sparkSQL.R:2929: "path should be > charactor, NULL or omitted.") > R/pkg/inst/tests/testthat/test_sparkSQL.R:2931: "mode should be > charactor or omitted. It is 'error' by default.") > R/pkg/inst/tests/testthat/test_sparkSQL.R:2950: "path should be > charactor, NULL or omitted.") > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20298) Spelling mistake: charactor
[ https://issues.apache.org/jira/browse/SPARK-20298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15964889#comment-15964889 ] Sean Owen commented on SPARK-20298: --- Go ahead, though this is not generally worth a JIRA > Spelling mistake: charactor > --- > > Key: SPARK-20298 > URL: https://issues.apache.org/jira/browse/SPARK-20298 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.1.0 >Reporter: Brendan Dwyer >Priority: Trivial > > "charactor" should be "character" > {code} > R/pkg/R/DataFrame.R:2821: stop("path should be charactor, NULL > or omitted.") > R/pkg/R/DataFrame.R:2828: stop("mode should be charactor or > omitted. It is 'error' by default.") > R/pkg/R/DataFrame.R:3043: stop("value should be an integer, > numeric, charactor or named list.") > R/pkg/R/DataFrame.R:3055: stop("Each item in value should be > an integer, numeric or charactor.") > R/pkg/R/DataFrame.R:3601: stop("outputMode should be charactor > or omitted.") > R/pkg/R/SQLContext.R:609:stop("path should be charactor, NULL or > omitted.") > R/pkg/inst/tests/testthat/test_sparkSQL.R:2929: "path should be > charactor, NULL or omitted.") > R/pkg/inst/tests/testthat/test_sparkSQL.R:2931: "mode should be > charactor or omitted. It is 'error' by default.") > R/pkg/inst/tests/testthat/test_sparkSQL.R:2950: "path should be > charactor, NULL or omitted.") > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19505) AttributeError on Exception.message in Python3; hides true exceptions in cloudpickle.py and broadcast.py
[ https://issues.apache.org/jira/browse/SPARK-19505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] holdenk reassigned SPARK-19505: --- Assignee: David Gingrich > AttributeError on Exception.message in Python3; hides true exceptions in > cloudpickle.py and broadcast.py > > > Key: SPARK-19505 > URL: https://issues.apache.org/jira/browse/SPARK-19505 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.1.0 > Environment: macOS Sierra 10.12.3 > Spark 2.1.0, installed via Homebrew >Reporter: David Gingrich >Assignee: David Gingrich >Priority: Minor > Fix For: 2.2.0 > > > cloudpickle.py and broadcast.py both catch Exceptions then look at the > 'message' field. The message field doesn't exist in Python 3 so it rethrows > and the original exception is lost. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19505) AttributeError on Exception.message in Python3; hides true exceptions in cloudpickle.py and broadcast.py
[ https://issues.apache.org/jira/browse/SPARK-19505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] holdenk resolved SPARK-19505. - Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 16845 [https://github.com/apache/spark/pull/16845] > AttributeError on Exception.message in Python3; hides true exceptions in > cloudpickle.py and broadcast.py > > > Key: SPARK-19505 > URL: https://issues.apache.org/jira/browse/SPARK-19505 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.1.0 > Environment: macOS Sierra 10.12.3 > Spark 2.1.0, installed via Homebrew >Reporter: David Gingrich >Priority: Minor > Fix For: 2.2.0 > > > cloudpickle.py and broadcast.py both catch Exceptions then look at the > 'message' field. The message field doesn't exist in Python 3 so it rethrows > and the original exception is lost. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20131) Flaky Test: o.a.s.streaming.StreamingContextSuite.SPARK-18560 Receiver data should be deserialized properly
[ https://issues.apache.org/jira/browse/SPARK-20131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15964811#comment-15964811 ] Apache Spark commented on SPARK-20131: -- User 'zsxwing' has created a pull request for this issue: https://github.com/apache/spark/pull/17610 > Flaky Test: o.a.s.streaming.StreamingContextSuite.SPARK-18560 Receiver data > should be deserialized properly > --- > > Key: SPARK-20131 > URL: https://issues.apache.org/jira/browse/SPARK-20131 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.2.0 >Reporter: Takuya Ueshin >Priority: Minor > Labels: flaky-test > > This test failed recently here: > https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-2.7/2861/testReport/junit/org.apache.spark.streaming/StreamingContextSuite/SPARK_18560_Receiver_data_should_be_deserialized_properly_/ > Dashboard > https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.streaming.StreamingContextSuite_name=SPARK-18560+Receiver+data+should+be+deserialized+properly. > Error Message > {code} > latch.await(60L, SECONDS) was false > {code} > {code} > org.scalatest.exceptions.TestFailedException: latch.await(60L, SECONDS) was > false > at > org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:500) > at > org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555) > at > org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:466) > at > org.apache.spark.streaming.StreamingContextSuite$$anonfun$43.apply$mcV$sp(StreamingContextSuite.scala:837) > at > org.apache.spark.streaming.StreamingContextSuite$$anonfun$43.apply(StreamingContextSuite.scala:810) > at > org.apache.spark.streaming.StreamingContextSuite$$anonfun$43.apply(StreamingContextSuite.scala:810) > at > org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) > at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) > at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > at org.scalatest.Transformer.apply(Transformer.scala:22) > at org.scalatest.Transformer.apply(Transformer.scala:20) > at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166) > at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:68) > at > org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163) > at > org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) > at > org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) > at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) > at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175) > at > org.apache.spark.streaming.StreamingContextSuite.org$scalatest$BeforeAndAfter$$super$runTest(StreamingContextSuite.scala:44) > at org.scalatest.BeforeAndAfter$class.runTest(BeforeAndAfter.scala:200) > at > org.apache.spark.streaming.StreamingContextSuite.runTest(StreamingContextSuite.scala:44) > at > org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) > at > org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) > at > org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413) > at > org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401) > at scala.collection.immutable.List.foreach(List.scala:381) > at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401) > at > org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396) > at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483) > at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208) > at org.scalatest.FunSuite.runTests(FunSuite.scala:1555) > at org.scalatest.Suite$class.run(Suite.scala:1424) > at > org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555) > at > org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212) > at > org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212) > at org.scalatest.SuperEngine.runImpl(Engine.scala:545) > at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212) > at > org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:31) > at > org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:257) > at > org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:256) > at >
[jira] [Updated] (SPARK-20131) Flaky Test: o.a.s.streaming.StreamingContextSuite.SPARK-18560 Receiver data should be deserialized properly
[ https://issues.apache.org/jira/browse/SPARK-20131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-20131: - Summary: Flaky Test: o.a.s.streaming.StreamingContextSuite.SPARK-18560 Receiver data should be deserialized properly (was: Flaky Test: org.apache.spark.streaming.StreamingContextSuite) > Flaky Test: o.a.s.streaming.StreamingContextSuite.SPARK-18560 Receiver data > should be deserialized properly > --- > > Key: SPARK-20131 > URL: https://issues.apache.org/jira/browse/SPARK-20131 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.2.0 >Reporter: Takuya Ueshin >Priority: Minor > Labels: flaky-test > > This test failed recently here: > https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-2.7/2861/testReport/junit/org.apache.spark.streaming/StreamingContextSuite/SPARK_18560_Receiver_data_should_be_deserialized_properly_/ > Dashboard > https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.streaming.StreamingContextSuite_name=SPARK-18560+Receiver+data+should+be+deserialized+properly. > Error Message > {code} > latch.await(60L, SECONDS) was false > {code} > {code} > org.scalatest.exceptions.TestFailedException: latch.await(60L, SECONDS) was > false > at > org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:500) > at > org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555) > at > org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:466) > at > org.apache.spark.streaming.StreamingContextSuite$$anonfun$43.apply$mcV$sp(StreamingContextSuite.scala:837) > at > org.apache.spark.streaming.StreamingContextSuite$$anonfun$43.apply(StreamingContextSuite.scala:810) > at > org.apache.spark.streaming.StreamingContextSuite$$anonfun$43.apply(StreamingContextSuite.scala:810) > at > org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) > at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) > at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > at org.scalatest.Transformer.apply(Transformer.scala:22) > at org.scalatest.Transformer.apply(Transformer.scala:20) > at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166) > at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:68) > at > org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163) > at > org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) > at > org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) > at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) > at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175) > at > org.apache.spark.streaming.StreamingContextSuite.org$scalatest$BeforeAndAfter$$super$runTest(StreamingContextSuite.scala:44) > at org.scalatest.BeforeAndAfter$class.runTest(BeforeAndAfter.scala:200) > at > org.apache.spark.streaming.StreamingContextSuite.runTest(StreamingContextSuite.scala:44) > at > org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) > at > org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) > at > org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413) > at > org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401) > at scala.collection.immutable.List.foreach(List.scala:381) > at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401) > at > org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396) > at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483) > at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208) > at org.scalatest.FunSuite.runTests(FunSuite.scala:1555) > at org.scalatest.Suite$class.run(Suite.scala:1424) > at > org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555) > at > org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212) > at > org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212) > at org.scalatest.SuperEngine.runImpl(Engine.scala:545) > at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212) > at > org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:31) > at > org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:257) > at > org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:256) > at >
[jira] [Commented] (SPARK-20296) UnsupportedOperationChecker text on distinct aggregations differs from docs
[ https://issues.apache.org/jira/browse/SPARK-20296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15964756#comment-15964756 ] Apache Spark commented on SPARK-20296: -- User 'jtoka' has created a pull request for this issue: https://github.com/apache/spark/pull/17609 > UnsupportedOperationChecker text on distinct aggregations differs from docs > --- > > Key: SPARK-20296 > URL: https://issues.apache.org/jira/browse/SPARK-20296 > Project: Spark > Issue Type: Documentation > Components: Structured Streaming >Affects Versions: 2.1.0 >Reporter: Jason Tokayer >Priority: Trivial > > In the unsupported operations section in the docs > https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html > it states that "Distinct operations on streaming Datasets are not > supported.". However, in > ```org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker.scala```, > the error message is ```Distinct aggregations are not supported on streaming > DataFrames/Datasets, unless it is on aggregated DataFrame/Dataset in Complete > output mode. Consider using approximate distinct aggregation```. > It seems that the error message is incorrect. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20296) UnsupportedOperationChecker text on distinct aggregations differs from docs
[ https://issues.apache.org/jira/browse/SPARK-20296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20296: Assignee: Apache Spark > UnsupportedOperationChecker text on distinct aggregations differs from docs > --- > > Key: SPARK-20296 > URL: https://issues.apache.org/jira/browse/SPARK-20296 > Project: Spark > Issue Type: Documentation > Components: Structured Streaming >Affects Versions: 2.1.0 >Reporter: Jason Tokayer >Assignee: Apache Spark >Priority: Trivial > > In the unsupported operations section in the docs > https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html > it states that "Distinct operations on streaming Datasets are not > supported.". However, in > ```org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker.scala```, > the error message is ```Distinct aggregations are not supported on streaming > DataFrames/Datasets, unless it is on aggregated DataFrame/Dataset in Complete > output mode. Consider using approximate distinct aggregation```. > It seems that the error message is incorrect. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20296) UnsupportedOperationChecker text on distinct aggregations differs from docs
[ https://issues.apache.org/jira/browse/SPARK-20296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20296: Assignee: (was: Apache Spark) > UnsupportedOperationChecker text on distinct aggregations differs from docs > --- > > Key: SPARK-20296 > URL: https://issues.apache.org/jira/browse/SPARK-20296 > Project: Spark > Issue Type: Documentation > Components: Structured Streaming >Affects Versions: 2.1.0 >Reporter: Jason Tokayer >Priority: Trivial > > In the unsupported operations section in the docs > https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html > it states that "Distinct operations on streaming Datasets are not > supported.". However, in > ```org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker.scala```, > the error message is ```Distinct aggregations are not supported on streaming > DataFrames/Datasets, unless it is on aggregated DataFrame/Dataset in Complete > output mode. Consider using approximate distinct aggregation```. > It seems that the error message is incorrect. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20289) Use StaticInvoke rather than NewInstance for boxing primitive types
[ https://issues.apache.org/jira/browse/SPARK-20289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-20289. - Resolution: Fixed Fix Version/s: 2.2.0 > Use StaticInvoke rather than NewInstance for boxing primitive types > --- > > Key: SPARK-20289 > URL: https://issues.apache.org/jira/browse/SPARK-20289 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Reynold Xin >Assignee: Reynold Xin > Fix For: 2.2.0 > > > Dataset typed API currently uses NewInstance to box primitive types (i.e. > calling the constructor). Instead, it'd be slightly more idiomatic in Java to > use PrimitiveType.valueOf, which can be invoked using StaticInvoke expression. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20298) Spelling mistake: charactor
[ https://issues.apache.org/jira/browse/SPARK-20298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15964741#comment-15964741 ] Brendan Dwyer commented on SPARK-20298: --- I can work on this > Spelling mistake: charactor > --- > > Key: SPARK-20298 > URL: https://issues.apache.org/jira/browse/SPARK-20298 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.1.0 >Reporter: Brendan Dwyer >Priority: Trivial > > "charactor" should be "character" > {code} > R/pkg/R/DataFrame.R:2821: stop("path should be charactor, NULL > or omitted.") > R/pkg/R/DataFrame.R:2828: stop("mode should be charactor or > omitted. It is 'error' by default.") > R/pkg/R/DataFrame.R:3043: stop("value should be an integer, > numeric, charactor or named list.") > R/pkg/R/DataFrame.R:3055: stop("Each item in value should be > an integer, numeric or charactor.") > R/pkg/R/DataFrame.R:3601: stop("outputMode should be charactor > or omitted.") > R/pkg/R/SQLContext.R:609:stop("path should be charactor, NULL or > omitted.") > R/pkg/inst/tests/testthat/test_sparkSQL.R:2929: "path should be > charactor, NULL or omitted.") > R/pkg/inst/tests/testthat/test_sparkSQL.R:2931: "mode should be > charactor or omitted. It is 'error' by default.") > R/pkg/inst/tests/testthat/test_sparkSQL.R:2950: "path should be > charactor, NULL or omitted.") > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20298) Spelling mistake: charactor
Brendan Dwyer created SPARK-20298: - Summary: Spelling mistake: charactor Key: SPARK-20298 URL: https://issues.apache.org/jira/browse/SPARK-20298 Project: Spark Issue Type: Bug Components: SparkR Affects Versions: 2.1.0 Reporter: Brendan Dwyer Priority: Trivial "charactor" should be "character" {code} R/pkg/R/DataFrame.R:2821: stop("path should be charactor, NULL or omitted.") R/pkg/R/DataFrame.R:2828: stop("mode should be charactor or omitted. It is 'error' by default.") R/pkg/R/DataFrame.R:3043: stop("value should be an integer, numeric, charactor or named list.") R/pkg/R/DataFrame.R:3055: stop("Each item in value should be an integer, numeric or charactor.") R/pkg/R/DataFrame.R:3601: stop("outputMode should be charactor or omitted.") R/pkg/R/SQLContext.R:609:stop("path should be charactor, NULL or omitted.") R/pkg/inst/tests/testthat/test_sparkSQL.R:2929: "path should be charactor, NULL or omitted.") R/pkg/inst/tests/testthat/test_sparkSQL.R:2931: "mode should be charactor or omitted. It is 'error' by default.") R/pkg/inst/tests/testthat/test_sparkSQL.R:2950: "path should be charactor, NULL or omitted.") {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20297) Parquet Decimal(12,2) written by Spark is unreadable by Hive and Impala
Mostafa Mokhtar created SPARK-20297: --- Summary: Parquet Decimal(12,2) written by Spark is unreadable by Hive and Impala Key: SPARK-20297 URL: https://issues.apache.org/jira/browse/SPARK-20297 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.1.0 Reporter: Mostafa Mokhtar Priority: Critical While trying to load some data using Spark 2.1 I realized that decimal(12,2) columns stored in Parquet written by Spark are not readable by Hive or Impala. Repro {code} CREATE TABLE customer_acctbal( c_acctbal decimal(12,2)) STORED AS Parquet; insert into customer_acctbal values (7539.95); {code} Error from Hive {code} Failed with exception java.io.IOException:parquet.io.ParquetDecodingException: Can not read value at 1 in block 0 in file hdfs://server1:8020/user/hive/warehouse/tpch_nested_3000_parquet.db/customer_acctbal/part-0-03d6e3bb-fe5e-4f20-87a4-88dec955dfcd.snappy.parquet Time taken: 0.122 seconds {code} Error from Impala {code} File 'hdfs://server:8020/user/hive/warehouse/tpch_nested_3000_parquet.db/customer_acctbal/part-0-32db4c61-fe67-4be2-9c16-b55c75c517a4.snappy.parquet' has an incompatible Parquet schema for column 'tpch_nested_3000_parquet.customer_acctbal.c_acctbal'. Column type: DECIMAL(12,2), Parquet schema: optional int64 c_acctbal [i:0 d:1 r:0] (1 of 2 similar) {code} Table info {code} hive> describe formatted customer_acctbal; OK # col_name data_type comment c_acctbal decimal(12,2) # Detailed Table Information Database: tpch_nested_3000_parquet Owner: mmokhtar CreateTime: Mon Apr 10 17:47:24 PDT 2017 LastAccessTime: UNKNOWN Protect Mode: None Retention: 0 Location: hdfs://server1.com:8020/user/hive/warehouse/tpch_nested_3000_parquet.db/customer_acctbal Table Type: MANAGED_TABLE Table Parameters: COLUMN_STATS_ACCURATE true numFiles1 numRows 0 rawDataSize 0 totalSize 120 transient_lastDdlTime 1491871644 # Storage Information SerDe Library: org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe InputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat OutputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat Compressed: No Num Buckets:-1 Bucket Columns: [] Sort Columns: [] Storage Desc Params: serialization.format1 Time taken: 0.032 seconds, Fetched: 31 row(s) {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20296) UnsupportedOperationChecker text on distinct aggregations differs from docs
[ https://issues.apache.org/jira/browse/SPARK-20296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-20296: -- Priority: Trivial (was: Minor) Sounds fine, you can submit a PR to make the messages consistent. > UnsupportedOperationChecker text on distinct aggregations differs from docs > --- > > Key: SPARK-20296 > URL: https://issues.apache.org/jira/browse/SPARK-20296 > Project: Spark > Issue Type: Documentation > Components: Structured Streaming >Affects Versions: 2.1.0 >Reporter: Jason Tokayer >Priority: Trivial > > In the unsupported operations section in the docs > https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html > it states that "Distinct operations on streaming Datasets are not > supported.". However, in > ```org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker.scala```, > the error message is ```Distinct aggregations are not supported on streaming > DataFrames/Datasets, unless it is on aggregated DataFrame/Dataset in Complete > output mode. Consider using approximate distinct aggregation```. > It seems that the error message is incorrect. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20296) UnsupportedOperationChecker text on distinct aggregations differs from docs
Jason Tokayer created SPARK-20296: - Summary: UnsupportedOperationChecker text on distinct aggregations differs from docs Key: SPARK-20296 URL: https://issues.apache.org/jira/browse/SPARK-20296 Project: Spark Issue Type: Documentation Components: Structured Streaming Affects Versions: 2.1.0 Reporter: Jason Tokayer Priority: Minor In the unsupported operations section in the docs https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html it states that "Distinct operations on streaming Datasets are not supported.". However, in ```org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker.scala```, the error message is ```Distinct aggregations are not supported on streaming DataFrames/Datasets, unless it is on aggregated DataFrame/Dataset in Complete output mode. Consider using approximate distinct aggregation```. It seems that the error message is incorrect. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19659) Fetch big blocks to disk when shuffle-read
[ https://issues.apache.org/jira/browse/SPARK-19659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15964596#comment-15964596 ] jin xing commented on SPARK-19659: -- *bytesShuffleToMemory* is different from *bytesInFlight*. *bytesInFlight* is for only one *ShuffleBlockFetcherIterator* and get decreased when the remote blocks is received. *bytesShuffleToMemory* is for all ShuffleBlockFetcherIterators and get decreased only when reference count of ByteBuf is zero(though the memory maybe still inside cache and not really destroyed). If *spark.reducer.maxReqsInFlight* is only for memory control, I think *spark.reducer.maxBytesShuffleToMemory* is an improvement. In the current PR, I want to simplify the logic and the memory is tracked by *bytesShuffleToMemory* and memory usage is not tracked by MemoryManager. > Fetch big blocks to disk when shuffle-read > -- > > Key: SPARK-19659 > URL: https://issues.apache.org/jira/browse/SPARK-19659 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Affects Versions: 2.1.0 >Reporter: jin xing > Attachments: SPARK-19659-design-v1.pdf, SPARK-19659-design-v2.pdf > > > Currently the whole block is fetched into memory(offheap by default) when > shuffle-read. A block is defined by (shuffleId, mapId, reduceId). Thus it can > be large when skew situations. If OOM happens during shuffle read, job will > be killed and users will be notified to "Consider boosting > spark.yarn.executor.memoryOverhead". Adjusting parameter and allocating more > memory can resolve the OOM. However the approach is not perfectly suitable > for production environment, especially for data warehouse. > Using Spark SQL as data engine in warehouse, users hope to have a unified > parameter(e.g. memory) but less resource wasted(resource is allocated but not > used), > It's not always easy to predict skew situations, when happen, it make sense > to fetch remote blocks to disk for shuffle-read, rather than > kill the job because of OOM. This approach is mentioned during the discussion > in SPARK-3019, by [~sandyr] and [~mridulm80] -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-6305) Add support for log4j 2.x to Spark
[ https://issues.apache.org/jira/browse/SPARK-6305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15964554#comment-15964554 ] Dan Dutrow edited comment on SPARK-6305 at 4/11/17 4:18 PM: Any update on the current status of this ticket? It would be particularly beneficial to our program, and presumably many others, if the KafkaAppender (available in log4j 2.x) could be provided to spark workers for real-time log aggregation. https://logging.apache.org/log4j/2.x/manual/appenders.html#KafkaAppender This is not an easy thing to override in a spark application due to native log4j loading with the CoarseGrainedExecutorBackend before the application. was (Author: dutrow): Please provide an update on the current status of this ticket? It would be particularly beneficial to our program, and presumably many others, if the KafkaAppender (available in log4j 2.x) could be provided to spark workers for real-time log aggregation. https://logging.apache.org/log4j/2.x/manual/appenders.html#KafkaAppender This is not an easy thing to override in a spark application due to native log4j loading with the CoarseGrainedExecutorBackend before the application. > Add support for log4j 2.x to Spark > -- > > Key: SPARK-6305 > URL: https://issues.apache.org/jira/browse/SPARK-6305 > Project: Spark > Issue Type: Improvement > Components: Build >Reporter: Tal Sliwowicz >Priority: Minor > > log4j 2 requires replacing the slf4j binding and adding the log4j jars in the > classpath. Since there are shaded jars, it must be done during the build. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-19659) Fetch big blocks to disk when shuffle-read
[ https://issues.apache.org/jira/browse/SPARK-19659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15964587#comment-15964587 ] jin xing edited comment on SPARK-19659 at 4/11/17 4:13 PM: --- [~cloud_fan] Thanks a lot for taking look into this and sorry for late reply. My proposal is as below: 1. Track average size and also the outliers(which are larger than 2*avgSize) in MapStatus; 2. Add *bytesShuffleToMemory*, which tracks the size of remote blocks shuffled to memory; 3. Add *spark.reducer.maxBytesShuffleToMemory*, when bytesShuffleToMemory is above this configuration, blocks will shuffle to disk; *bytesShuffleToMemory* is increased when send fetch request(note that at this point, remote blocks maybe not fetched into memory yet, but we add the max memory to be used in the fetch request) and get decreased when the ByteBuf is released. *spark.reducer.maxBytesShuffleToMemory* is the max memory to be used for fetching remote blocks across all the ShuffleBlockFetcherIterators(there maybe multiple shuffle-read happening at the same time). When memory usage(indicated by *bytesShuffleToMemory*) is above *spark.reducer.maxBytesShuffleToMemory*, shuffle remote blocks to disk instead of memory. was (Author: jinxing6...@126.com): [~cloud_fan] Thanks a lot for taking look into this and sorry for late reply. My proposal is as below: 1. Track average size and also the outliers(which are larger than 2*avgSize) in MapStatus; 2. Add *bytesShuffleToMemory*, which tracks the size of remote blocks shuffled to memory; 3. Add *spark.reducer.maxBytesShuffleToMemory*, when bytesShuffleToMemory is above this configuration, blocks will shuffle to disk; *bytesShuffleToMemory* is increased when send fetch request(note that at this point, remote blocks maybe not fetched into memory yet, but we add the max memory to be used in the fetch request) and get decreased when the ByteBuf is released. *spark.reducer.maxBytesShuffleToMemory* is the max memory to be used for fetching remote blocks across all the *ShuffleBlockFetcherIterator*s(there maybe multiple shuffle-read happening at the same time). When memory usage(indicated by *bytesShuffleToMemory*) is above *spark.reducer.maxBytesShuffleToMemory*, shuffle remote blocks to disk instead of memory. > Fetch big blocks to disk when shuffle-read > -- > > Key: SPARK-19659 > URL: https://issues.apache.org/jira/browse/SPARK-19659 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Affects Versions: 2.1.0 >Reporter: jin xing > Attachments: SPARK-19659-design-v1.pdf, SPARK-19659-design-v2.pdf > > > Currently the whole block is fetched into memory(offheap by default) when > shuffle-read. A block is defined by (shuffleId, mapId, reduceId). Thus it can > be large when skew situations. If OOM happens during shuffle read, job will > be killed and users will be notified to "Consider boosting > spark.yarn.executor.memoryOverhead". Adjusting parameter and allocating more > memory can resolve the OOM. However the approach is not perfectly suitable > for production environment, especially for data warehouse. > Using Spark SQL as data engine in warehouse, users hope to have a unified > parameter(e.g. memory) but less resource wasted(resource is allocated but not > used), > It's not always easy to predict skew situations, when happen, it make sense > to fetch remote blocks to disk for shuffle-read, rather than > kill the job because of OOM. This approach is mentioned during the discussion > in SPARK-3019, by [~sandyr] and [~mridulm80] -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-19659) Fetch big blocks to disk when shuffle-read
[ https://issues.apache.org/jira/browse/SPARK-19659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15964587#comment-15964587 ] jin xing edited comment on SPARK-19659 at 4/11/17 4:11 PM: --- [~cloud_fan] Thanks a lot for taking look into this and sorry for late reply. My proposal is as below: 1. Track average size and also the outliers(which are larger than 2*avgSize) in MapStatus; 2. Add *bytesShuffleToMemory*, which tracks the size of remote blocks shuffled to memory; 3. Add *spark.reducer.maxBytesShuffleToMemory*, when bytesShuffleToMemory is above this configuration, blocks will shuffle to disk; *bytesShuffleToMemory* is increased when send fetch request(note that at this point, remote blocks maybe not fetched into memory yet, but we add the max memory to be used in the fetch request) and get decreased when the ByteBuf is released. *spark.reducer.maxBytesShuffleToMemory* is the max memory to be used for fetching remote blocks across all the *ShuffleBlockFetcherIterator*s(there maybe multiple shuffle-read happening at the same time). When memory usage(indicated by *bytesShuffleToMemory*) is above *spark.reducer.maxBytesShuffleToMemory*, shuffle remote blocks to disk instead of memory. was (Author: jinxing6...@126.com): [~cloud_fan] Thanks a lot for taking look into this and sorry for late reply. My proposal is as below: 1. Track average size and also the outliers(which are larger than 2*avgSize) in MapStatus; 2. Add *bytesShuffleToMemory*, which tracks the size of remote blocks shuffled to memory; 3. Add *spark.reducer.maxBytesShuffleToMemory*, when bytesShuffleToMemory is above this configuration, blocks will shuffle to disk; *bytesShuffleToMemory* is increased when send fetch request(note that at this point, remote blocks maybe not fetched into memory yet, but we add the max memory to be used in the fetch request) and get decreased when the ByteBuf is released. *spark.reducer.maxBytesShuffleToMemory* is the max memory to be used for fetching remote blocks across all the ShuffleBlockFetcherIterators(there maybe multiple shuffle-read happening at the same time). When memory usage(indicated by bytesShuffleToMemory) is above spark.reducer.maxBytesShuffleToMemory, shuffle remote blocks to disk instead of memory. > Fetch big blocks to disk when shuffle-read > -- > > Key: SPARK-19659 > URL: https://issues.apache.org/jira/browse/SPARK-19659 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Affects Versions: 2.1.0 >Reporter: jin xing > Attachments: SPARK-19659-design-v1.pdf, SPARK-19659-design-v2.pdf > > > Currently the whole block is fetched into memory(offheap by default) when > shuffle-read. A block is defined by (shuffleId, mapId, reduceId). Thus it can > be large when skew situations. If OOM happens during shuffle read, job will > be killed and users will be notified to "Consider boosting > spark.yarn.executor.memoryOverhead". Adjusting parameter and allocating more > memory can resolve the OOM. However the approach is not perfectly suitable > for production environment, especially for data warehouse. > Using Spark SQL as data engine in warehouse, users hope to have a unified > parameter(e.g. memory) but less resource wasted(resource is allocated but not > used), > It's not always easy to predict skew situations, when happen, it make sense > to fetch remote blocks to disk for shuffle-read, rather than > kill the job because of OOM. This approach is mentioned during the discussion > in SPARK-3019, by [~sandyr] and [~mridulm80] -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-19659) Fetch big blocks to disk when shuffle-read
[ https://issues.apache.org/jira/browse/SPARK-19659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15964587#comment-15964587 ] jin xing edited comment on SPARK-19659 at 4/11/17 4:11 PM: --- [~cloud_fan] Thanks a lot for taking look into this and sorry for late reply. My proposal is as below: 1. Track average size and also the outliers(which are larger than 2*avgSize) in MapStatus; 2. Add *bytesShuffleToMemory*, which tracks the size of remote blocks shuffled to memory; 3. Add *spark.reducer.maxBytesShuffleToMemory*, when bytesShuffleToMemory is above this configuration, blocks will shuffle to disk; *bytesShuffleToMemory* is increased when send fetch request(note that at this point, remote blocks maybe not fetched into memory yet, but we add the max memory to be used in the fetch request) and get decreased when the ByteBuf is released. *spark.reducer.maxBytesShuffleToMemory* is the max memory to be used for fetching remote blocks across all the ShuffleBlockFetcherIterators(there maybe multiple shuffle-read happening at the same time). When memory usage(indicated by bytesShuffleToMemory) is above spark.reducer.maxBytesShuffleToMemory, shuffle remote blocks to disk instead of memory. was (Author: jinxing6...@126.com): [~cloud_fan] Thanks a lot for taking look into this and sorry for late reply. My proposal is as below: 1. Track average size and also the outliers(which are larger than 2*avgSize) in MapStatus; 2. Add bytesShuffleToMemory, which tracks the size of remote blocks shuffled to memory; 3. Add spark.reducer.maxBytesShuffleToMemory, when bytesShuffleToMemory is above this configuration, blocks will shuffle to disk; bytesShuffleToMemory is increased when send fetch request(note that at this point, remote blocks maybe not fetched into memory yet, but we add the max memory to be used in the fetch request) and get decreased when the ByteBuf is released. spark.reducer.maxBytesShuffleToMemory is the max memory to be used for fetching remote blocks across all the ShuffleBlockFetcherIterators(there maybe multiple shuffle-read happening at the same time). When memory usage(indicated by bytesShuffleToMemory) is above spark.reducer.maxBytesShuffleToMemory, shuffle remote blocks to disk instead of memory. > Fetch big blocks to disk when shuffle-read > -- > > Key: SPARK-19659 > URL: https://issues.apache.org/jira/browse/SPARK-19659 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Affects Versions: 2.1.0 >Reporter: jin xing > Attachments: SPARK-19659-design-v1.pdf, SPARK-19659-design-v2.pdf > > > Currently the whole block is fetched into memory(offheap by default) when > shuffle-read. A block is defined by (shuffleId, mapId, reduceId). Thus it can > be large when skew situations. If OOM happens during shuffle read, job will > be killed and users will be notified to "Consider boosting > spark.yarn.executor.memoryOverhead". Adjusting parameter and allocating more > memory can resolve the OOM. However the approach is not perfectly suitable > for production environment, especially for data warehouse. > Using Spark SQL as data engine in warehouse, users hope to have a unified > parameter(e.g. memory) but less resource wasted(resource is allocated but not > used), > It's not always easy to predict skew situations, when happen, it make sense > to fetch remote blocks to disk for shuffle-read, rather than > kill the job because of OOM. This approach is mentioned during the discussion > in SPARK-3019, by [~sandyr] and [~mridulm80] -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19659) Fetch big blocks to disk when shuffle-read
[ https://issues.apache.org/jira/browse/SPARK-19659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15964587#comment-15964587 ] jin xing commented on SPARK-19659: -- [~cloud_fan] Thanks a lot for taking look into this and sorry for late reply. My proposal is as below: 1. Track average size and also the outliers(which are larger than 2*avgSize) in MapStatus; 2. Add bytesShuffleToMemory, which tracks the size of remote blocks shuffled to memory; 3. Add spark.reducer.maxBytesShuffleToMemory, when bytesShuffleToMemory is above this configuration, blocks will shuffle to disk; bytesShuffleToMemory is increased when send fetch request(note that at this point, remote blocks maybe not fetched into memory yet, but we add the max memory to be used in the fetch request) and get decreased when the ByteBuf is released. spark.reducer.maxBytesShuffleToMemory is the max memory to be used for fetching remote blocks across all the ShuffleBlockFetcherIterators(there maybe multiple shuffle-read happening at the same time). When memory usage(indicated by bytesShuffleToMemory) is above spark.reducer.maxBytesShuffleToMemory, shuffle remote blocks to disk instead of memory. > Fetch big blocks to disk when shuffle-read > -- > > Key: SPARK-19659 > URL: https://issues.apache.org/jira/browse/SPARK-19659 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Affects Versions: 2.1.0 >Reporter: jin xing > Attachments: SPARK-19659-design-v1.pdf, SPARK-19659-design-v2.pdf > > > Currently the whole block is fetched into memory(offheap by default) when > shuffle-read. A block is defined by (shuffleId, mapId, reduceId). Thus it can > be large when skew situations. If OOM happens during shuffle read, job will > be killed and users will be notified to "Consider boosting > spark.yarn.executor.memoryOverhead". Adjusting parameter and allocating more > memory can resolve the OOM. However the approach is not perfectly suitable > for production environment, especially for data warehouse. > Using Spark SQL as data engine in warehouse, users hope to have a unified > parameter(e.g. memory) but less resource wasted(resource is allocated but not > used), > It's not always easy to predict skew situations, when happen, it make sense > to fetch remote blocks to disk for shuffle-read, rather than > kill the job because of OOM. This approach is mentioned during the discussion > in SPARK-3019, by [~sandyr] and [~mridulm80] -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6305) Add support for log4j 2.x to Spark
[ https://issues.apache.org/jira/browse/SPARK-6305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15964554#comment-15964554 ] Dan Dutrow commented on SPARK-6305: --- Please provide an update on the current status of this ticket? It would be particularly beneficial to our program, and presumably many others, if the KafkaAppender (available in log4j 2.x) could be provided to spark workers for real-time log aggregation. https://logging.apache.org/log4j/2.x/manual/appenders.html#KafkaAppender This is not an easy thing to override in a spark application due to native log4j loading with the CoarseGrainedExecutorBackend before the application. > Add support for log4j 2.x to Spark > -- > > Key: SPARK-6305 > URL: https://issues.apache.org/jira/browse/SPARK-6305 > Project: Spark > Issue Type: Improvement > Components: Build >Reporter: Tal Sliwowicz >Priority: Minor > > log4j 2 requires replacing the slf4j binding and adding the log4j jars in the > classpath. Since there are shaded jars, it must be done during the build. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19659) Fetch big blocks to disk when shuffle-read
[ https://issues.apache.org/jira/browse/SPARK-19659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15964548#comment-15964548 ] Wenchen Fan commented on SPARK-19659: - [~jinxing6...@126.com] can you say more about fetching shuffle blocks into disk? How should users tune the {{spark.reducer.maxBytesShuffleToMemory}} config? Shall we have something similar to {{maxReqsInFlight}} to control how many request we can have to fetch blocks to disk at the same time? > Fetch big blocks to disk when shuffle-read > -- > > Key: SPARK-19659 > URL: https://issues.apache.org/jira/browse/SPARK-19659 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Affects Versions: 2.1.0 >Reporter: jin xing > Attachments: SPARK-19659-design-v1.pdf, SPARK-19659-design-v2.pdf > > > Currently the whole block is fetched into memory(offheap by default) when > shuffle-read. A block is defined by (shuffleId, mapId, reduceId). Thus it can > be large when skew situations. If OOM happens during shuffle read, job will > be killed and users will be notified to "Consider boosting > spark.yarn.executor.memoryOverhead". Adjusting parameter and allocating more > memory can resolve the OOM. However the approach is not perfectly suitable > for production environment, especially for data warehouse. > Using Spark SQL as data engine in warehouse, users hope to have a unified > parameter(e.g. memory) but less resource wasted(resource is allocated but not > used), > It's not always easy to predict skew situations, when happen, it make sense > to fetch remote blocks to disk for shuffle-read, rather than > kill the job because of OOM. This approach is mentioned during the discussion > in SPARK-3019, by [~sandyr] and [~mridulm80] -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19659) Fetch big blocks to disk when shuffle-read
[ https://issues.apache.org/jira/browse/SPARK-19659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15964508#comment-15964508 ] jin xing commented on SPARK-19659: -- [~irashid] Tracking memory used by Netty by swapping in our own PooledByteBufAllocator is really a good idea. Memory usage will be increased when allocate byte buffer in PoolByteBufAllocator and get decreased when ByteBuf's reference count is zero. Checking source code of Netty, I found that there is cache inside PoolByteBufAllocator. When memory is released, it will be returned to cache or chunklist, not destroyed necessarily. In my understanding, we can get the memory usage by tracking PooledByteBufAllocator, but the value is not the real footprint.(i.e. when memory is released, it maybe in cache other than destroyed.) > Fetch big blocks to disk when shuffle-read > -- > > Key: SPARK-19659 > URL: https://issues.apache.org/jira/browse/SPARK-19659 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Affects Versions: 2.1.0 >Reporter: jin xing > Attachments: SPARK-19659-design-v1.pdf, SPARK-19659-design-v2.pdf > > > Currently the whole block is fetched into memory(offheap by default) when > shuffle-read. A block is defined by (shuffleId, mapId, reduceId). Thus it can > be large when skew situations. If OOM happens during shuffle read, job will > be killed and users will be notified to "Consider boosting > spark.yarn.executor.memoryOverhead". Adjusting parameter and allocating more > memory can resolve the OOM. However the approach is not perfectly suitable > for production environment, especially for data warehouse. > Using Spark SQL as data engine in warehouse, users hope to have a unified > parameter(e.g. memory) but less resource wasted(resource is allocated but not > used), > It's not always easy to predict skew situations, when happen, it make sense > to fetch remote blocks to disk for shuffle-read, rather than > kill the job because of OOM. This approach is mentioned during the discussion > in SPARK-3019, by [~sandyr] and [~mridulm80] -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19659) Fetch big blocks to disk when shuffle-read
[ https://issues.apache.org/jira/browse/SPARK-19659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15964476#comment-15964476 ] Imran Rashid commented on SPARK-19659: -- currently the smallest unit is an entire block. That certainly could be changed, but that is another larger change to the way spark uses netty. bq. If the unit is block, I think it's really hard to avoid OOM entirely, as if the estimated block size is wrong, fetching this block may cause OOM and we can do nothing about it. I don't totally agree with this. Yes, the estimated block size could be wrong. The main problem now, though, is that the error is totally unbounded. I really don't think it should be hard to change spark to provide reasonable bounds on that error. Even if that first step doesn't entirely eliminate the chance of OOM, it would certainly help. And that can continue to be improved upon. > Fetch big blocks to disk when shuffle-read > -- > > Key: SPARK-19659 > URL: https://issues.apache.org/jira/browse/SPARK-19659 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Affects Versions: 2.1.0 >Reporter: jin xing > Attachments: SPARK-19659-design-v1.pdf, SPARK-19659-design-v2.pdf > > > Currently the whole block is fetched into memory(offheap by default) when > shuffle-read. A block is defined by (shuffleId, mapId, reduceId). Thus it can > be large when skew situations. If OOM happens during shuffle read, job will > be killed and users will be notified to "Consider boosting > spark.yarn.executor.memoryOverhead". Adjusting parameter and allocating more > memory can resolve the OOM. However the approach is not perfectly suitable > for production environment, especially for data warehouse. > Using Spark SQL as data engine in warehouse, users hope to have a unified > parameter(e.g. memory) but less resource wasted(resource is allocated but not > used), > It's not always easy to predict skew situations, when happen, it make sense > to fetch remote blocks to disk for shuffle-read, rather than > kill the job because of OOM. This approach is mentioned during the discussion > in SPARK-3019, by [~sandyr] and [~mridulm80] -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16784) Configurable log4j settings
[ https://issues.apache.org/jira/browse/SPARK-16784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15964471#comment-15964471 ] Josh Bacon commented on SPARK-16784: [~tscholak] I have not created a follow up issue for this. For your situation I'd suggest trying --driver-java-options='..' instead of --conf='spark.driver.extraJavaOptions=...' because the latter is applied after driver jvm actually starts (too late for log4j). My use-case I abandoned attempting to configured log4j for executors, but was able to work with driver/application logs in both cluster and client mode (standalone) via baking my log4j.properties files into my apps Uber jar resources. I think the need of this issue is a new feature for distributing files/log4j.properties in the cluster before the actual Spark Driver starts which I'd imagine is not a pressing enough to warrant actual development at the moment. > Configurable log4j settings > --- > > Key: SPARK-16784 > URL: https://issues.apache.org/jira/browse/SPARK-16784 > Project: Spark > Issue Type: Improvement >Affects Versions: 2.0.0, 2.1.0 >Reporter: Michael Gummelt > > I often want to change the logging configuration on a single spark job. This > is easy in client mode. I just modify log4j.properties. It's difficult in > cluster mode, because I need to modify the log4j.properties in the > distribution in which the driver runs. I'd like a way of setting this > dynamically, such as a java system property. Some brief searching showed > that log4j doesn't seem to accept such a property, but I'd like to open up > this idea for further comment. Maybe we can find a solution. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20284) Make SerializationStream and DeserializationStream extend Closeable
[ https://issues.apache.org/jira/browse/SPARK-20284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15964463#comment-15964463 ] Sergei Lebedev commented on SPARK-20284: Yes, Scala is a bad citizen in the JVM land and comes w/o any support for try-with-resources. IIUC scala-arm would manage just fine without Closeable because it uses structural types. However, I think there is no reason not to implement Closeable/AutoCloseable even if Spark/Scala code does not need this. > Make SerializationStream and DeserializationStream extend Closeable > --- > > Key: SPARK-20284 > URL: https://issues.apache.org/jira/browse/SPARK-20284 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.6.3, 2.1.0 >Reporter: Sergei Lebedev >Priority: Trivial > > Both {{SerializationStream}} and {{DeserializationStream}} implement > {{close}} but do not extend {{Closeable}}. As a result, these streams cannot > be used in try-with-resources. > Was this intentional or rather nobody ever needed that? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-11788) Using java.sql.Timestamp and java.sql.Date in where clauses on JDBC dataframes causes SQLServerException
[ https://issues.apache.org/jira/browse/SPARK-11788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15964438#comment-15964438 ] Björn-Elmar Macek edited comment on SPARK-11788 at 4/11/17 2:39 PM: I have an issue which maybe related and occurs in 2.1.0. I created a table which contains a Timestamp column. An describe results in: user_id int null userCreatedAt timestamp null country_iso string null When i query as follows ... select date_format(userCreatedAt, ""), userCreatedAt, date_format(userCreatedAt, "") = "2017" from result where date_format(userCreatedAt, "") = "2017" order by userCreatedAt ... the third column contains only true values as it should be due to the corresponding expression being in the where clause. When i execute the same query on the same table in which i replaced userCreatedAt by a java.sql.Timestamp created from the userCreatedAt column, the result contains lots of false values. This is correct, since the year of those timestamp is not 2017. But i would expect them to not be included in the result. Also, date_format(userCreatedAt, "") returns the correct year. Can anybody reproduce this issue? was (Author: macek): I have an issue which maybe related and occurs in 2.1.0. I created a table which contains a Timestamp column. An describe results in: user_id int null userCreatedAt timestamp null country_iso string null When i query as follows ... select date_format(userCreatedAt, ""), userCreatedAt, date_format(userCreatedAt, "") = "2017" from result where date_format(userCreatedAt, "") = "2017" order by userCreatedAt ... the third column contains only true values as it should be due to the corresponding expression being in the where clause. When i execute the same query on the same table in which i replaced userCreatedAt by a java.sql.Timestamp created from the userCreatedAt column, the result contains lots of false values. This is correct, since the year of those timestamp is not 2017. But i would expect them to not be included in the result. Also, date_format(userCreatedAt, "") returns the correct year. Can anybody reproduce this issue and will it be fixed? > Using java.sql.Timestamp and java.sql.Date in where clauses on JDBC > dataframes causes SQLServerException > > > Key: SPARK-11788 > URL: https://issues.apache.org/jira/browse/SPARK-11788 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.2 >Reporter: Martin Tapp >Assignee: Huaxin Gao > Fix For: 1.5.3, 1.6.0 > > > I have a MSSQL table that has a timestamp column and am reading it using > DataFrameReader.jdbc. Adding a where clause which compares a timestamp range > causes a SQLServerException. > The problem is in > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala#L264 > (compileValue) which should surround timestamps/dates with quotes (only does > it for strings). > Sample pseudo-code: > val beg = new java.sql.Timestamp(...) > val end = new java.sql.Timestamp(...) > val filtered = jdbcdf.where($"TIMESTAMP_COLUMN" >= beg && $"TIMESTAMP_COLUMN" > < end) > Generated SQL query: "TIMESTAMP_COLUMN >= 2015-01-01 00:00:00.0" > Query should use quotes around timestamp: "TIMESTAMP_COLUMN >= '2015-01-01 > 00:00:00.0'" > Fallback is to filter client-side which is extremely inefficient as the whole > table needs to be downloaded to each Spark executor. > Thanks -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-11788) Using java.sql.Timestamp and java.sql.Date in where clauses on JDBC dataframes causes SQLServerException
[ https://issues.apache.org/jira/browse/SPARK-11788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15964438#comment-15964438 ] Björn-Elmar Macek edited comment on SPARK-11788 at 4/11/17 2:34 PM: I have an issue which maybe related and occurs in 2.1.0. I created a table which contains a Timestamp column. An describe results in: user_id int null userCreatedAt timestamp null country_iso string null When i query as follows ... select date_format(userCreatedAt, ""), userCreatedAt, date_format(userCreatedAt, "") = "2017" from result where date_format(userCreatedAt, "") = "2017" order by userCreatedAt ... the third column contains only true values as it should be due to the corresponding expression being in the where clause. When i execute the same query on the same table in which i replaced userCreatedAt by a java.sql.Timestamp created from the userCreatedAt column, the result contains lots of false values. This is correct, since the year of those timestamp is not 2017. But i would expect them to not be included in the result. Also, date_format(userCreatedAt, "") returns the correct year. Can anybody reproduce this issue and will it be fixed? was (Author: macek): I have an issue which maybe related and occurs in 2.1.0. I created a table which contains a Timestamp column. An describe results in: user_id int null userCreatedAt timestamp null country_iso string null When i query as follows ... select date_format(userCreatedAt, ""), userCreatedAt, date_format(userCreatedAt, "") = "2017" from result where date_format(userCreatedAt, "") = "2017" order by userCreatedAt ... the third column contains only true values as it should be due to the expression being in the where clause. When i execute the same query on the same table in which i replaced userCreatedAt by a java.sql.Timestamp created from the userCreatedAt column, the result contains lots of false values. This is correct, since the year of those timestamp is not 2017. But i would expect them to not be included in the result. Also, date_format(userCreatedAt, "") returns the correct year. Can anybody reproduce this issue and will it be fixed? > Using java.sql.Timestamp and java.sql.Date in where clauses on JDBC > dataframes causes SQLServerException > > > Key: SPARK-11788 > URL: https://issues.apache.org/jira/browse/SPARK-11788 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.2 >Reporter: Martin Tapp >Assignee: Huaxin Gao > Fix For: 1.5.3, 1.6.0 > > > I have a MSSQL table that has a timestamp column and am reading it using > DataFrameReader.jdbc. Adding a where clause which compares a timestamp range > causes a SQLServerException. > The problem is in > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala#L264 > (compileValue) which should surround timestamps/dates with quotes (only does > it for strings). > Sample pseudo-code: > val beg = new java.sql.Timestamp(...) > val end = new java.sql.Timestamp(...) > val filtered = jdbcdf.where($"TIMESTAMP_COLUMN" >= beg && $"TIMESTAMP_COLUMN" > < end) > Generated SQL query: "TIMESTAMP_COLUMN >= 2015-01-01 00:00:00.0" > Query should use quotes around timestamp: "TIMESTAMP_COLUMN >= '2015-01-01 > 00:00:00.0'" > Fallback is to filter client-side which is extremely inefficient as the whole > table needs to be downloaded to each Spark executor. > Thanks -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-11788) Using java.sql.Timestamp and java.sql.Date in where clauses on JDBC dataframes causes SQLServerException
[ https://issues.apache.org/jira/browse/SPARK-11788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15964438#comment-15964438 ] Björn-Elmar Macek edited comment on SPARK-11788 at 4/11/17 2:28 PM: I have an issue which maybe related and occurs in 2.1.0. I created a table which contains a Timestamp column. An describe results in: user_id int null userCreatedAt timestamp null country_iso string null When i query as follows ... select date_format(userCreatedAt, ""), userCreatedAt, date_format(userCreatedAt, "") = "2017" from result where date_format(userCreatedAt, "") = "2017" order by userCreatedAt ... the third column contains only true values as it should be due to the expression being in the where clause. When i execute the same query on the same table in which i replaced userCreatedAt by a java.sql.Timestamp created from the userCreatedAt column, the result contains lots of false values. This is correct, since the year of those timestamp is not 2017. But i would expect them to not be included in the result. Also, date_format(userCreatedAt, "") returns the correct year. Can anybody reproduce this issue and will it be fixed? was (Author: macek): I have an issue which maybe related and occurs in 2.1.0. I created a table which contains a Timestamp column. An describe results in: user_id int null userCreatedAt timestamp null country_iso string null When i query as follows ... select date_format(userCreatedAt, ""), userCreatedAt, date_format(userCreatedAt, "") = "2017" from result where date_format(userCreatedAt, "") = "2017" order by userCreatedAt ... the third column contains only true values as it should be due to the expression being in the where clause. When i execute the same query on the same table in which i replaced userCreatedAt by a java.sql.Timestamp created from the userCreatedAt column, the result contains lots of false values. Also, date_format(userCreatedAt, "") returns the correct year. Can anybody reproduce this issue and will it be fixed? > Using java.sql.Timestamp and java.sql.Date in where clauses on JDBC > dataframes causes SQLServerException > > > Key: SPARK-11788 > URL: https://issues.apache.org/jira/browse/SPARK-11788 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.2 >Reporter: Martin Tapp >Assignee: Huaxin Gao > Fix For: 1.5.3, 1.6.0 > > > I have a MSSQL table that has a timestamp column and am reading it using > DataFrameReader.jdbc. Adding a where clause which compares a timestamp range > causes a SQLServerException. > The problem is in > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala#L264 > (compileValue) which should surround timestamps/dates with quotes (only does > it for strings). > Sample pseudo-code: > val beg = new java.sql.Timestamp(...) > val end = new java.sql.Timestamp(...) > val filtered = jdbcdf.where($"TIMESTAMP_COLUMN" >= beg && $"TIMESTAMP_COLUMN" > < end) > Generated SQL query: "TIMESTAMP_COLUMN >= 2015-01-01 00:00:00.0" > Query should use quotes around timestamp: "TIMESTAMP_COLUMN >= '2015-01-01 > 00:00:00.0'" > Fallback is to filter client-side which is extremely inefficient as the whole > table needs to be downloaded to each Spark executor. > Thanks -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11788) Using java.sql.Timestamp and java.sql.Date in where clauses on JDBC dataframes causes SQLServerException
[ https://issues.apache.org/jira/browse/SPARK-11788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15964438#comment-15964438 ] Björn-Elmar Macek commented on SPARK-11788: --- I have an issue which maybe related and occurs in 2.1.0. I created a table which contains a Timestamp column. An describe results in: user_id int null userCreatedAt timestamp null country_iso string null When i query as follows ... select date_format(userCreatedAt, ""), userCreatedAt, date_format(userCreatedAt, "") = "2017" from result where date_format(userCreatedAt, "") = "2017" order by userCreatedAt ... the third column contains only true values as it should be due to the expression being in the where clause. When i execute the same query on the same table in which i replaced userCreatedAt by a java.sql.Timestamp created from the userCreatedAt column, the result contains lots of false values. Also, date_format(userCreatedAt, "") returns the correct year. Can anybody reproduce this issue and will it be fixed? > Using java.sql.Timestamp and java.sql.Date in where clauses on JDBC > dataframes causes SQLServerException > > > Key: SPARK-11788 > URL: https://issues.apache.org/jira/browse/SPARK-11788 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.2 >Reporter: Martin Tapp >Assignee: Huaxin Gao > Fix For: 1.5.3, 1.6.0 > > > I have a MSSQL table that has a timestamp column and am reading it using > DataFrameReader.jdbc. Adding a where clause which compares a timestamp range > causes a SQLServerException. > The problem is in > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala#L264 > (compileValue) which should surround timestamps/dates with quotes (only does > it for strings). > Sample pseudo-code: > val beg = new java.sql.Timestamp(...) > val end = new java.sql.Timestamp(...) > val filtered = jdbcdf.where($"TIMESTAMP_COLUMN" >= beg && $"TIMESTAMP_COLUMN" > < end) > Generated SQL query: "TIMESTAMP_COLUMN >= 2015-01-01 00:00:00.0" > Query should use quotes around timestamp: "TIMESTAMP_COLUMN >= '2015-01-01 > 00:00:00.0'" > Fallback is to filter client-side which is extremely inefficient as the whole > table needs to be downloaded to each Spark executor. > Thanks -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20244) Incorrect input size in UI with pyspark
[ https://issues.apache.org/jira/browse/SPARK-20244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15964422#comment-15964422 ] Saisai Shao commented on SPARK-20244: - This actually is not a UI problem, it is FileSystem thread local statistics problem, because PythonRDD will create another thread to read data, so the readBytes getting from another thread will be error. But there's no problem if using spark-shell, since everything is processed in one thread. This is a general problem if the child RDD's computation creates another thread to handle parent's RDD (HadoopRDD)'s iterator. I tried several different ways to handle this problem, but still have some small issues. The multi-thread processing inside the RDD make the fix quite complex. > Incorrect input size in UI with pyspark > --- > > Key: SPARK-20244 > URL: https://issues.apache.org/jira/browse/SPARK-20244 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.0.0, 2.1.0 >Reporter: Artur Sukhenko >Priority: Minor > Attachments: pyspark_incorrect_inputsize.png, > sparkshell_correct_inputsize.png > > > In Spark UI (Details for Stage) Input Size is 64.0 KB when running in > PySparkShell. > Also it is incorrect in Tasks table: > 64.0 KB / 132120575 in pyspark > 252.0 MB / 132120575 in spark-shell > I will attach screenshots. > Reproduce steps: > Run this to generate big file (press Ctrl+C after 5-6 seconds) > $ yes > /tmp/yes.txt > $ hadoop fs -copyFromLocal /tmp/yes.txt /tmp/ > $ ./bin/pyspark > {code} > Python 2.7.5 (default, Nov 6 2016, 00:28:07) > [GCC 4.8.5 20150623 (Red Hat 4.8.5-11)] on linux2 > Type "help", "copyright", "credits" or "license" for more information. > Setting default log level to "WARN". > To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use > setLogLevel(newLevel). > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/__ / .__/\_,_/_/ /_/\_\ version 2.1.0 > /_/ > Using Python version 2.7.5 (default, Nov 6 2016 00:28:07) > SparkSession available as 'spark'.{code} > >>> a = sc.textFile("/tmp/yes.txt") > >>> a.count() > Open Spark UI and check Stage 0. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20295) when spark.sql.adaptive.enabled is enabled, have conflict with Exchange Resue
[ https://issues.apache.org/jira/browse/SPARK-20295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruhui Wang updated SPARK-20295: --- Description: when spark.sql.exchange.reuse is opened, then run a query with self join(such as tpcds-q95), the physical plan will become below randomly: WholeStageCodegen : +- Project [id#0L] : +- BroadcastHashJoin [id#0L], [id#2L], Inner, BuildRight, None ::- Project [id#0L] :: +- BroadcastHashJoin [id#0L], [id#1L], Inner, BuildRight, None :: :- Range 0, 1, 4, 1024, [id#0L] :: +- INPUT :+- INPUT :- BroadcastExchange HashedRelationBroadcastMode(true,List(id#1L),List(id#1L)) : +- WholeStageCodegen : : +- Range 0, 1, 4, 1024, [id#1L] +- ReusedExchange [id#2L], BroadcastExchange HashedRelationBroadcastMode(true,List(id#1L),List(id#1L)) If spark.sql.adaptive.enabled = true, the code stack is : ShuffleExchange#doExecute --> postShuffleRDD function --> doEstimationIfNecessary . In this function, assert(exchanges.length == numExchanges) will be error, as left side has only one element, but right is equal to 2. If this is a bug of spark.sql.adaptive.enabled and exchange resue? was: when spark.sql.exchange.reuse is opened, then run a query with self join(such as tpcds-q95), the physical plan will become below randomly: WholeStageCodegen : +- Project [id#0L] : +- BroadcastHashJoin [id#0L], [id#2L], Inner, BuildRight, None ::- Project [id#0L] :: +- BroadcastHashJoin [id#0L], [id#1L], Inner, BuildRight, None :: :- Range 0, 1, 4, 1024, [id#0L] :: +- INPUT :+- INPUT :- BroadcastExchange HashedRelationBroadcastMode(true,List(id#1L),List(id#1L)) : +- WholeStageCodegen : : +- Range 0, 1, 4, 1024, [id#1L] +- ReusedExchange [id#2L], BroadcastExchange HashedRelationBroadcastMode(true,List(id#1L),List(id#1L)) If spark.sql.adaptive.enabled = true, the code stack is : ShuffleExchange#doExecute --> postShuffleRDD function --> doEstimationIfNecessary . In this function, assert(exchanges.length == numExchanges) will be error, as left side has only one element, but right is equal to 2. If this is a bug of spark.sql.adaptive.enabled and exchange resue > when spark.sql.adaptive.enabled is enabled, have conflict with Exchange Resue > -- > > Key: SPARK-20295 > URL: https://issues.apache.org/jira/browse/SPARK-20295 > Project: Spark > Issue Type: Bug > Components: Shuffle, SQL >Affects Versions: 2.1.0 >Reporter: Ruhui Wang > > when spark.sql.exchange.reuse is opened, then run a query with self join(such > as tpcds-q95), the physical plan will become below randomly: > WholeStageCodegen > : +- Project [id#0L] > : +- BroadcastHashJoin [id#0L], [id#2L], Inner, BuildRight, None > ::- Project [id#0L] > :: +- BroadcastHashJoin [id#0L], [id#1L], Inner, BuildRight, None > :: :- Range 0, 1, 4, 1024, [id#0L] > :: +- INPUT > :+- INPUT > :- BroadcastExchange HashedRelationBroadcastMode(true,List(id#1L),List(id#1L)) > : +- WholeStageCodegen > : : +- Range 0, 1, 4, 1024, [id#1L] > +- ReusedExchange [id#2L], BroadcastExchange > HashedRelationBroadcastMode(true,List(id#1L),List(id#1L)) > If spark.sql.adaptive.enabled = true, the code stack is : > ShuffleExchange#doExecute --> postShuffleRDD function --> > doEstimationIfNecessary . In this function, > assert(exchanges.length == numExchanges) will be error, as left side has only > one element, but right is equal to 2. > If this is a bug of spark.sql.adaptive.enabled and exchange resue? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20295) when spark.sql.adaptive.enabled is enabled, have conflict with Exchange Resue
Ruhui Wang created SPARK-20295: -- Summary: when spark.sql.adaptive.enabled is enabled, have conflict with Exchange Resue Key: SPARK-20295 URL: https://issues.apache.org/jira/browse/SPARK-20295 Project: Spark Issue Type: Bug Components: Shuffle, SQL Affects Versions: 2.1.0 Reporter: Ruhui Wang when spark.sql.exchange.reuse is opened, then run a query with self join(such as tpcds-q95), the physical plan will become below randomly: WholeStageCodegen : +- Project [id#0L] : +- BroadcastHashJoin [id#0L], [id#2L], Inner, BuildRight, None ::- Project [id#0L] :: +- BroadcastHashJoin [id#0L], [id#1L], Inner, BuildRight, None :: :- Range 0, 1, 4, 1024, [id#0L] :: +- INPUT :+- INPUT :- BroadcastExchange HashedRelationBroadcastMode(true,List(id#1L),List(id#1L)) : +- WholeStageCodegen : : +- Range 0, 1, 4, 1024, [id#1L] +- ReusedExchange [id#2L], BroadcastExchange HashedRelationBroadcastMode(true,List(id#1L),List(id#1L)) If spark.sql.adaptive.enabled = true, the code stack is : ShuffleExchange#doExecute --> postShuffleRDD function --> doEstimationIfNecessary . In this function, assert(exchanges.length == numExchanges) will be error, as left side has only one element, but right is equal to 2. If this is a bug of spark.sql.adaptive.enabled and exchange resue -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20294) _inferSchema for RDDs fails if sample returns empty RDD
[ https://issues.apache.org/jira/browse/SPARK-20294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] João Pedro Jericó updated SPARK-20294: -- Description: Currently the _inferSchema function on [session.py](https://github.com/apache/spark/blob/master/python/pyspark/sql/session.py#L354) line 354 fails if applied to an RDD for which the sample call returns an empty RDD. This is possible for example if one has a small RDD but that needs the schema to be inferred by more than one Row. For example: ```python small_rdd = sc.parallelize([(1, 2), (2, 'foo')]) small_rdd.toDF(samplingRatio=0.01).show() ``` This will fail with high probability because when sampling the small_rdd with the .sample method it will return an empty RDD most of the time. However, this is not the desired result because we are able to sample at least 1% of the RDD. This is probably a problem with the other Spark APIs however I don't have the knowledge to look at the source code for other languages. was: Currently the _inferSchema function on [session.py](https://github.com/apache/spark/blob/master/python/pyspark/sql/session.py#L354) line 354 fails if applied to an RDD for which the sample call returns an empty RDD. This is possible for example if one has a small RDD but that needs the schema to be inferred by more than one Row. For example: ```python small_rdd = sc.parallelize([(1, 2), (2, 'foo')]) small_rdd.toDF(samplingRatio=0.01).show() ``` This will fail with high probability because when sampling the small_rdd with the .sample method it will return an empty RDD most of the time. However, this is not the desired result because we are able to sample at least 1% of the RDD. > _inferSchema for RDDs fails if sample returns empty RDD > --- > > Key: SPARK-20294 > URL: https://issues.apache.org/jira/browse/SPARK-20294 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.1.0 >Reporter: João Pedro Jericó >Priority: Minor > > Currently the _inferSchema function on > [session.py](https://github.com/apache/spark/blob/master/python/pyspark/sql/session.py#L354) > line 354 fails if applied to an RDD for which the sample call returns an > empty RDD. This is possible for example if one has a small RDD but that needs > the schema to be inferred by more than one Row. For example: > ```python > small_rdd = sc.parallelize([(1, 2), (2, 'foo')]) > small_rdd.toDF(samplingRatio=0.01).show() > ``` > This will fail with high probability because when sampling the small_rdd with > the .sample method it will return an empty RDD most of the time. However, > this is not the desired result because we are able to sample at least 1% of > the RDD. > This is probably a problem with the other Spark APIs however I don't have the > knowledge to look at the source code for other languages. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20294) _inferSchema for RDDs fails if sample returns empty RDD
[ https://issues.apache.org/jira/browse/SPARK-20294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] João Pedro Jericó updated SPARK-20294: -- Description: Currently the _inferSchema function on [session.py](https://github.com/apache/spark/blob/master/python/pyspark/sql/session.py#L354) line 354 fails if applied to an RDD for which the sample call returns an empty RDD. This is possible for example if one has a small RDD but that needs the schema to be inferred by more than one Row. For example: ```python small_rdd = sc.parallelize([(1, 2), (2, 'foo')]) small_rdd.toDF(samplingRatio=0.01).show() ``` This will fail with high probability because when sampling the small_rdd with the .sample method it will return an empty RDD most of the time. However, this is not the desired result because we are able to sample at least 1% of the RDD. was: Currently the _inferSchema function on [session.py](https://github.com/apache/spark/blob/master/python/pyspark/sql/session.py#L354) line 354 fails if applied to an RDD for which the sample call returns an empty RDD. This is possible for example if one has a small RDD but that needs the schema to be inferred by more than one Row. For example: ```python small_rdd = sc.parallelize([(1, 2), (2, 'foo')]) small_rdd.toDF(samplingRatio=0.01).show() ``` This will fail with high probability because when sampling the small_rdd with the .sample method will return an empty RDD most of the time. However, this is not the desired result because we are able to sample at least 1% of the RDD. > _inferSchema for RDDs fails if sample returns empty RDD > --- > > Key: SPARK-20294 > URL: https://issues.apache.org/jira/browse/SPARK-20294 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.1.0 >Reporter: João Pedro Jericó >Priority: Minor > > Currently the _inferSchema function on > [session.py](https://github.com/apache/spark/blob/master/python/pyspark/sql/session.py#L354) > line 354 fails if applied to an RDD for which the sample call returns an > empty RDD. This is possible for example if one has a small RDD but that needs > the schema to be inferred by more than one Row. For example: > ```python > small_rdd = sc.parallelize([(1, 2), (2, 'foo')]) > small_rdd.toDF(samplingRatio=0.01).show() > ``` > This will fail with high probability because when sampling the small_rdd with > the .sample method it will return an empty RDD most of the time. However, > this is not the desired result because we are able to sample at least 1% of > the RDD. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20294) _inferSchema for RDDs fails if sample returns empty RDD
João Pedro Jericó created SPARK-20294: - Summary: _inferSchema for RDDs fails if sample returns empty RDD Key: SPARK-20294 URL: https://issues.apache.org/jira/browse/SPARK-20294 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 2.1.0 Reporter: João Pedro Jericó Priority: Minor Currently the _inferSchema function on [session.py](https://github.com/apache/spark/blob/master/python/pyspark/sql/session.py#L354) line 354 fails if applied to an RDD for which the sample call returns an empty RDD. This is possible for example if one has a small RDD but that needs the schema to be inferred by more than one Row. For example: ```python small_rdd = sc.parallelize([(1, 2), (2, 'foo')]) small_rdd.toDF(samplingRatio=0.01).show() ``` This will fail with high probability because when sampling the small_rdd with the .sample method will return an empty RDD most of the time. However, this is not the desired result because we are able to sample at least 1% of the RDD. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20175) Exists should not be evaluated in Join operator and can be converted to ScalarSubquery if no correlated reference
[ https://issues.apache.org/jira/browse/SPARK-20175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-20175: --- Assignee: Liang-Chi Hsieh > Exists should not be evaluated in Join operator and can be converted to > ScalarSubquery if no correlated reference > - > > Key: SPARK-20175 > URL: https://issues.apache.org/jira/browse/SPARK-20175 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Liang-Chi Hsieh >Assignee: Liang-Chi Hsieh > Fix For: 2.2.0 > > > Similar to ListQuery, Exists should not be evaluated in Join operator too. > Otherwise, a query like following will fail: > sql("select * from l, r where l.a = r.c + 1 AND (exists (select * from r) OR > l.a = r.c)") > For the Exists subquery without correlated reference, this patch converts it > to scalar subquery with a count Aggregate operator. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20175) Exists should not be evaluated in Join operator and can be converted to ScalarSubquery if no correlated reference
[ https://issues.apache.org/jira/browse/SPARK-20175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-20175. - Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 17491 [https://github.com/apache/spark/pull/17491] > Exists should not be evaluated in Join operator and can be converted to > ScalarSubquery if no correlated reference > - > > Key: SPARK-20175 > URL: https://issues.apache.org/jira/browse/SPARK-20175 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Liang-Chi Hsieh > Fix For: 2.2.0 > > > Similar to ListQuery, Exists should not be evaluated in Join operator too. > Otherwise, a query like following will fail: > sql("select * from l, r where l.a = r.c + 1 AND (exists (select * from r) OR > l.a = r.c)") > For the Exists subquery without correlated reference, this patch converts it > to scalar subquery with a count Aggregate operator. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20274) support compatible array element type in encoder
[ https://issues.apache.org/jira/browse/SPARK-20274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-20274. - Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 17587 [https://github.com/apache/spark/pull/17587] > support compatible array element type in encoder > > > Key: SPARK-20274 > URL: https://issues.apache.org/jira/browse/SPARK-20274 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan > Fix For: 2.2.0 > > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20284) Make SerializationStream and DeserializationStream extend Closeable
[ https://issues.apache.org/jira/browse/SPARK-20284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15964263#comment-15964263 ] Sean Owen commented on SPARK-20284: --- It's a developer API, so dont' think callers will generally be using this. try-with-resources isn't a JVM feature, it's a Java language feature. For example this doesn't do anything for Scala AFAIK. > Make SerializationStream and DeserializationStream extend Closeable > --- > > Key: SPARK-20284 > URL: https://issues.apache.org/jira/browse/SPARK-20284 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.6.3, 2.1.0 >Reporter: Sergei Lebedev >Priority: Trivial > > Both {{SerializationStream}} and {{DeserializationStream}} implement > {{close}} but do not extend {{Closeable}}. As a result, these streams cannot > be used in try-with-resources. > Was this intentional or rather nobody ever needed that? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-20284) Make SerializationStream and DeserializationStream extend Closeable
[ https://issues.apache.org/jira/browse/SPARK-20284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15964257#comment-15964257 ] Sergei Lebedev edited comment on SPARK-20284 at 4/11/17 11:57 AM: -- It makes the stream well-behaved for any JVM language, e.g. pure-Java or [Kotlin|https://kotlinlang.org/api/latest/jvm/stdlib/kotlin.io/use.html]. was (Author: lebedev): It makes the stream well-behaved for any JVM user, e.g. pure-Java or [Kotlin|https://kotlinlang.org/api/latest/jvm/stdlib/kotlin.io/use.html]. > Make SerializationStream and DeserializationStream extend Closeable > --- > > Key: SPARK-20284 > URL: https://issues.apache.org/jira/browse/SPARK-20284 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.6.3, 2.1.0 >Reporter: Sergei Lebedev >Priority: Trivial > > Both {{SerializationStream}} and {{DeserializationStream}} implement > {{close}} but do not extend {{Closeable}}. As a result, these streams cannot > be used in try-with-resources. > Was this intentional or rather nobody ever needed that? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20284) Make SerializationStream and DeserializationStream extend Closeable
[ https://issues.apache.org/jira/browse/SPARK-20284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15964257#comment-15964257 ] Sergei Lebedev commented on SPARK-20284: It makes the stream well-behaved for any JVM user, e.g. pure-Java or [Kotlin|https://kotlinlang.org/api/latest/jvm/stdlib/kotlin.io/use.html]. > Make SerializationStream and DeserializationStream extend Closeable > --- > > Key: SPARK-20284 > URL: https://issues.apache.org/jira/browse/SPARK-20284 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.6.3, 2.1.0 >Reporter: Sergei Lebedev >Priority: Trivial > > Both {{SerializationStream}} and {{DeserializationStream}} implement > {{close}} but do not extend {{Closeable}}. As a result, these streams cannot > be used in try-with-resources. > Was this intentional or rather nobody ever needed that? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20293) In the page of 'jobs' or 'stages' of history server web ui,,click the 'Go' button, query paging data, the page error
[ https://issues.apache.org/jira/browse/SPARK-20293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20293: Assignee: (was: Apache Spark) > In the page of 'jobs' or 'stages' of history server web ui,,click the 'Go' > button, query paging data, the page error > - > > Key: SPARK-20293 > URL: https://issues.apache.org/jira/browse/SPARK-20293 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.1.0 >Reporter: guoxiaolongzte > Attachments: error1.png, error2.png, jobs.png, stages.png > > > In the page of 'jobs' or 'stages' of history server web ui, > Click on the 'Go' button, query paging data, the page error, function can not > be used. > The reasons are as follows: > '#' Was escaped by the browser as% 23. > & CompletedStage.desc = true% 23completed, the parameter value desc becomes = > true% 23, causing the page to report an error. The error is as follows: > HTTP ERROR 400 > Problem Access / history / app-20170411132432-0004 / stages /. Reason: > For input string: "true # completed" > Powered by Jetty: // > The amendments are as follows: > The URL of the accessed URL is escaped to ensure that the URL is not escaped > by the browser. > please see attachment. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20293) In the page of 'jobs' or 'stages' of history server web ui,,click the 'Go' button, query paging data, the page error
[ https://issues.apache.org/jira/browse/SPARK-20293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15964184#comment-15964184 ] Apache Spark commented on SPARK-20293: -- User 'guoxiaolongzte' has created a pull request for this issue: https://github.com/apache/spark/pull/17608 > In the page of 'jobs' or 'stages' of history server web ui,,click the 'Go' > button, query paging data, the page error > - > > Key: SPARK-20293 > URL: https://issues.apache.org/jira/browse/SPARK-20293 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.1.0 >Reporter: guoxiaolongzte > Attachments: error1.png, error2.png, jobs.png, stages.png > > > In the page of 'jobs' or 'stages' of history server web ui, > Click on the 'Go' button, query paging data, the page error, function can not > be used. > The reasons are as follows: > '#' Was escaped by the browser as% 23. > & CompletedStage.desc = true% 23completed, the parameter value desc becomes = > true% 23, causing the page to report an error. The error is as follows: > HTTP ERROR 400 > Problem Access / history / app-20170411132432-0004 / stages /. Reason: > For input string: "true # completed" > Powered by Jetty: // > The amendments are as follows: > The URL of the accessed URL is escaped to ensure that the URL is not escaped > by the browser. > please see attachment. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20293) In the page of 'jobs' or 'stages' of history server web ui,,click the 'Go' button, query paging data, the page error
[ https://issues.apache.org/jira/browse/SPARK-20293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20293: Assignee: Apache Spark > In the page of 'jobs' or 'stages' of history server web ui,,click the 'Go' > button, query paging data, the page error > - > > Key: SPARK-20293 > URL: https://issues.apache.org/jira/browse/SPARK-20293 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.1.0 >Reporter: guoxiaolongzte >Assignee: Apache Spark > Attachments: error1.png, error2.png, jobs.png, stages.png > > > In the page of 'jobs' or 'stages' of history server web ui, > Click on the 'Go' button, query paging data, the page error, function can not > be used. > The reasons are as follows: > '#' Was escaped by the browser as% 23. > & CompletedStage.desc = true% 23completed, the parameter value desc becomes = > true% 23, causing the page to report an error. The error is as follows: > HTTP ERROR 400 > Problem Access / history / app-20170411132432-0004 / stages /. Reason: > For input string: "true # completed" > Powered by Jetty: // > The amendments are as follows: > The URL of the accessed URL is escaped to ensure that the URL is not escaped > by the browser. > please see attachment. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20293) In the page of 'jobs' or 'stages' of history server web ui,,click the 'Go' button, query paging data, the page error
[ https://issues.apache.org/jira/browse/SPARK-20293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] guoxiaolongzte updated SPARK-20293: --- Attachment: stages.png jobs.png error2.png error1.png > In the page of 'jobs' or 'stages' of history server web ui,,click the 'Go' > button, query paging data, the page error > - > > Key: SPARK-20293 > URL: https://issues.apache.org/jira/browse/SPARK-20293 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.1.0 >Reporter: guoxiaolongzte > Attachments: error1.png, error2.png, jobs.png, stages.png > > > In the page of 'jobs' or 'stages' of history server web ui, > Click on the 'Go' button, query paging data, the page error, function can not > be used. > The reasons are as follows: > '#' Was escaped by the browser as% 23. > & CompletedStage.desc = true% 23completed, the parameter value desc becomes = > true% 23, causing the page to report an error. The error is as follows: > HTTP ERROR 400 > Problem Access / history / app-20170411132432-0004 / stages /. Reason: > For input string: "true # completed" > Powered by Jetty: // > The amendments are as follows: > The URL of the accessed URL is escaped to ensure that the URL is not escaped > by the browser. > please see attachment. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20293) In the page of 'jobs' or 'stages' of history server web ui,,click the 'Go' button, query paging data, the page error
[ https://issues.apache.org/jira/browse/SPARK-20293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] guoxiaolongzte updated SPARK-20293: --- Description: In the page of 'jobs' or 'stages' of history server web ui, Click on the 'Go' button, query paging data, the page error, function can not be used. The reasons are as follows: '#' Was escaped by the browser as% 23. & CompletedStage.desc = true% 23completed, the parameter value desc becomes = true% 23, causing the page to report an error. The error is as follows: HTTP ERROR 400 Problem Access / history / app-20170411132432-0004 / stages /. Reason: For input string: "true # completed" Powered by Jetty: // The amendments are as follows: The URL of the accessed URL is escaped to ensure that the URL is not escaped by the browser. please see attachment. Summary: In the page of 'jobs' or 'stages' of history server web ui,,click the 'Go' button, query paging data, the page error (was: In history web ui,click the 'Go' button, paging query, the page error) > In the page of 'jobs' or 'stages' of history server web ui,,click the 'Go' > button, query paging data, the page error > - > > Key: SPARK-20293 > URL: https://issues.apache.org/jira/browse/SPARK-20293 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.1.0 >Reporter: guoxiaolongzte > > In the page of 'jobs' or 'stages' of history server web ui, > Click on the 'Go' button, query paging data, the page error, function can not > be used. > The reasons are as follows: > '#' Was escaped by the browser as% 23. > & CompletedStage.desc = true% 23completed, the parameter value desc becomes = > true% 23, causing the page to report an error. The error is as follows: > HTTP ERROR 400 > Problem Access / history / app-20170411132432-0004 / stages /. Reason: > For input string: "true # completed" > Powered by Jetty: // > The amendments are as follows: > The URL of the accessed URL is escaped to ensure that the URL is not escaped > by the browser. > please see attachment. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org