[jira] [Commented] (SPARK-18872) New test cases for EXISTS subquery
[ https://issues.apache.org/jira/browse/SPARK-18872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15852674#comment-15852674 ] Apache Spark commented on SPARK-18872: -- User 'dilipbiswal' has created a pull request for this issue: https://github.com/apache/spark/pull/16802 > New test cases for EXISTS subquery > -- > > Key: SPARK-18872 > URL: https://issues.apache.org/jira/browse/SPARK-18872 > Project: Spark > Issue Type: Sub-task > Components: SQL, Tests >Reporter: Nattavut Sutyanyong >Assignee: Dilip Biswal > Fix For: 2.2.0 > > > This JIRA is for submitting a PR for new EXISTS/NOT EXISTS subquery test > cases. It follows the same idea as the IN subquery test cases which contain > simple patterns, then build more complex constructs in both parent and > subquery sides. This batch of test cases are mostly, if not all, positive > test cases that do not return any syntax errors or unsupported functionality. > We make effort to have test cases returning rows in the result set so that > they can indirectly detect incorrect result problems. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19446) Remove unused findTightestCommonType in TypeCoercion
[ https://issues.apache.org/jira/browse/SPARK-19446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li reassigned SPARK-19446: --- Assignee: Hyukjin Kwon > Remove unused findTightestCommonType in TypeCoercion > > > Key: SPARK-19446 > URL: https://issues.apache.org/jira/browse/SPARK-19446 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Trivial > Fix For: 2.2.0 > > > It seems the codes below is not used anymore. > {code} > /** >* Find the tightest common type of a set of types by continuously applying >* `findTightestCommonTypeOfTwo` on these types. >*/ > private def findTightestCommonType(types: Seq[DataType]): Option[DataType] > = { > types.foldLeft[Option[DataType]](Some(NullType))((r, c) => r match { > case None => None > case Some(d) => findTightestCommonTypeOfTwo(d, c) > }) > } > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19446) Remove unused findTightestCommonType in TypeCoercion
[ https://issues.apache.org/jira/browse/SPARK-19446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-19446. - Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 16786 [https://github.com/apache/spark/pull/16786] > Remove unused findTightestCommonType in TypeCoercion > > > Key: SPARK-19446 > URL: https://issues.apache.org/jira/browse/SPARK-19446 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Hyukjin Kwon >Priority: Trivial > Fix For: 2.2.0 > > > It seems the codes below is not used anymore. > {code} > /** >* Find the tightest common type of a set of types by continuously applying >* `findTightestCommonTypeOfTwo` on these types. >*/ > private def findTightestCommonType(types: Seq[DataType]): Option[DataType] > = { > types.foldLeft[Option[DataType]](Some(NullType))((r, c) => r match { > case None => None > case Some(d) => findTightestCommonTypeOfTwo(d, c) > }) > } > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19428) Ability to select first row of groupby
[ https://issues.apache.org/jira/browse/SPARK-19428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15852582#comment-15852582 ] Takeshi Yamamuro commented on SPARK-19428: -- Thanks for the explanation! "df.select($"group").distinct()" is not enough for the your case? > Ability to select first row of groupby > -- > > Key: SPARK-19428 > URL: https://issues.apache.org/jira/browse/SPARK-19428 > Project: Spark > Issue Type: Brainstorming > Components: SQL >Affects Versions: 2.1.0 >Reporter: Luke Miner >Priority: Minor > > It would be nice to be able to select the first row from {{GroupedData}}. > Pandas has something like this: > {{df.groupby('group').first()}} > It's especially handy if you can order the group as well. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19353) Support binary I/O in PipedRDD
[ https://issues.apache.org/jira/browse/SPARK-19353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15852498#comment-15852498 ] Sergei Lebedev commented on SPARK-19353: For reference: we have a fully backward-compatible [implementation|https://github.com/criteo-forks/spark/pull/26] of binary PipedRDD in our GitHub fork. > Support binary I/O in PipedRDD > -- > > Key: SPARK-19353 > URL: https://issues.apache.org/jira/browse/SPARK-19353 > Project: Spark > Issue Type: Improvement >Reporter: Sergei Lebedev >Priority: Minor > > The current design of RDD.pipe is very restrictive. > It is line-based, each element of the input RDD [gets > serialized|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/PipedRDD.scala#L143] > into one or more lines. Similarly for the output of the child process, one > line corresponds to a single element of the output RDD. > It allows to customize the output format via {{printRDDElement}}, but not the > input format. > It is not designed for extensibility. The only way to get a "BinaryPipedRDD" > is to copy/paste most of it and change the relevant parts. > These limitations have been discussed on > [SO|http://stackoverflow.com/questions/27986830/how-to-pipe-binary-data-in-apache-spark] > and the mailing list, but alas no issue has been created. > A possible solution to at least the first two limitations is to factor out > the format into a separate object (or objects). For instance, {{InputWriter}} > and {{OutputReader}}, following Hadoop streaming API. > {code} > trait InputWriter[T] { > def write(os: OutputStream, elem: T) > } > trait OutputReader[T] { > def read(is: InputStream): T > } > {code} > The default configuration would be to write and read in line-based format, > but the users will also be able to selectively swap those to the appropriate > implementations. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13619) Jobs page UI shows wrong number of failed tasks
[ https://issues.apache.org/jira/browse/SPARK-13619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13619: Assignee: Apache Spark > Jobs page UI shows wrong number of failed tasks > --- > > Key: SPARK-13619 > URL: https://issues.apache.org/jira/browse/SPARK-13619 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 1.6.0 >Reporter: Devaraj K >Assignee: Apache Spark >Priority: Minor > > In Master and History Server UI's, Jobs page shows the wrong number of failed > tasks. > http://X.X.X.X:8080/history/app-20160303024135-0001/jobs/ > h3. Completed Jobs (1) > ||Job Id||Description|| Submitted|| Duration|| Stages: > Succeeded/Total|| Tasks (for all stages): Succeeded/Total|| > |0 | saveAsTextFile at PipeLineTest.java:52| 2016/03/03 02:41:36 |16 s | > 2/2 | 100/100 (2 failed)| > \\ > \\ > When we go to the Job details page, we can see different number for failed > tasks and It is the correct number based on the failed tasks. > http://x.x.x.x:8080/history/app-20160303024135-0001/jobs/job/?id=0 > h3. Completed Stages (2) > ||Stage Id|| Description|| Submitted|| Duration|| Tasks: > Succeeded/Total||Input|| Output||Shuffle Read|| Shuffle > Write|| > |1| saveAsTextFile at PipeLineTest.java:52 +details|2016/03/03 02:41:51| > 1 s|50/50 (6 failed)| |7.6 KB|371.0 KB| | > |0| mapToPair at PipeLineTest.java:29 +details|2016/03/03 02:41:36| 15 s| > 50/50| 1521.7 MB| | | 371.0 KB| -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13619) Jobs page UI shows wrong number of failed tasks
[ https://issues.apache.org/jira/browse/SPARK-13619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15852470#comment-15852470 ] Apache Spark commented on SPARK-13619: -- User 'devaraj-kavali' has created a pull request for this issue: https://github.com/apache/spark/pull/16801 > Jobs page UI shows wrong number of failed tasks > --- > > Key: SPARK-13619 > URL: https://issues.apache.org/jira/browse/SPARK-13619 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 1.6.0 >Reporter: Devaraj K >Priority: Minor > > In Master and History Server UI's, Jobs page shows the wrong number of failed > tasks. > http://X.X.X.X:8080/history/app-20160303024135-0001/jobs/ > h3. Completed Jobs (1) > ||Job Id||Description|| Submitted|| Duration|| Stages: > Succeeded/Total|| Tasks (for all stages): Succeeded/Total|| > |0 | saveAsTextFile at PipeLineTest.java:52| 2016/03/03 02:41:36 |16 s | > 2/2 | 100/100 (2 failed)| > \\ > \\ > When we go to the Job details page, we can see different number for failed > tasks and It is the correct number based on the failed tasks. > http://x.x.x.x:8080/history/app-20160303024135-0001/jobs/job/?id=0 > h3. Completed Stages (2) > ||Stage Id|| Description|| Submitted|| Duration|| Tasks: > Succeeded/Total||Input|| Output||Shuffle Read|| Shuffle > Write|| > |1| saveAsTextFile at PipeLineTest.java:52 +details|2016/03/03 02:41:51| > 1 s|50/50 (6 failed)| |7.6 KB|371.0 KB| | > |0| mapToPair at PipeLineTest.java:29 +details|2016/03/03 02:41:36| 15 s| > 50/50| 1521.7 MB| | | 371.0 KB| -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13619) Jobs page UI shows wrong number of failed tasks
[ https://issues.apache.org/jira/browse/SPARK-13619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13619: Assignee: (was: Apache Spark) > Jobs page UI shows wrong number of failed tasks > --- > > Key: SPARK-13619 > URL: https://issues.apache.org/jira/browse/SPARK-13619 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 1.6.0 >Reporter: Devaraj K >Priority: Minor > > In Master and History Server UI's, Jobs page shows the wrong number of failed > tasks. > http://X.X.X.X:8080/history/app-20160303024135-0001/jobs/ > h3. Completed Jobs (1) > ||Job Id||Description|| Submitted|| Duration|| Stages: > Succeeded/Total|| Tasks (for all stages): Succeeded/Total|| > |0 | saveAsTextFile at PipeLineTest.java:52| 2016/03/03 02:41:36 |16 s | > 2/2 | 100/100 (2 failed)| > \\ > \\ > When we go to the Job details page, we can see different number for failed > tasks and It is the correct number based on the failed tasks. > http://x.x.x.x:8080/history/app-20160303024135-0001/jobs/job/?id=0 > h3. Completed Stages (2) > ||Stage Id|| Description|| Submitted|| Duration|| Tasks: > Succeeded/Total||Input|| Output||Shuffle Read|| Shuffle > Write|| > |1| saveAsTextFile at PipeLineTest.java:52 +details|2016/03/03 02:41:51| > 1 s|50/50 (6 failed)| |7.6 KB|371.0 KB| | > |0| mapToPair at PipeLineTest.java:29 +details|2016/03/03 02:41:36| 15 s| > 50/50| 1521.7 MB| | | 371.0 KB| -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19456) Add LinearSVC R API
[ https://issues.apache.org/jira/browse/SPARK-19456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15852450#comment-15852450 ] Apache Spark commented on SPARK-19456: -- User 'wangmiao1981' has created a pull request for this issue: https://github.com/apache/spark/pull/16800 > Add LinearSVC R API > --- > > Key: SPARK-19456 > URL: https://issues.apache.org/jira/browse/SPARK-19456 > Project: Spark > Issue Type: New Feature > Components: SparkR >Affects Versions: 2.2.0 >Reporter: Miao Wang > > Linear SVM classifier is newly added into ML and python API has been added. > This JIRA is to add R side API. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19386) Bisecting k-means in SparkR documentation
[ https://issues.apache.org/jira/browse/SPARK-19386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15852449#comment-15852449 ] Apache Spark commented on SPARK-19386: -- User 'actuaryzhang' has created a pull request for this issue: https://github.com/apache/spark/pull/16799 > Bisecting k-means in SparkR documentation > - > > Key: SPARK-19386 > URL: https://issues.apache.org/jira/browse/SPARK-19386 > Project: Spark > Issue Type: Bug > Components: ML, SparkR >Affects Versions: 2.2.0 >Reporter: Felix Cheung >Assignee: Krishna Kalyan > Fix For: 2.2.0 > > > we need updates to programming guide, example and vignettes -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19456) Add LinearSVC R API
[ https://issues.apache.org/jira/browse/SPARK-19456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19456: Assignee: (was: Apache Spark) > Add LinearSVC R API > --- > > Key: SPARK-19456 > URL: https://issues.apache.org/jira/browse/SPARK-19456 > Project: Spark > Issue Type: New Feature > Components: SparkR >Affects Versions: 2.2.0 >Reporter: Miao Wang > > Linear SVM classifier is newly added into ML and python API has been added. > This JIRA is to add R side API. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19456) Add LinearSVC R API
[ https://issues.apache.org/jira/browse/SPARK-19456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19456: Assignee: Apache Spark > Add LinearSVC R API > --- > > Key: SPARK-19456 > URL: https://issues.apache.org/jira/browse/SPARK-19456 > Project: Spark > Issue Type: New Feature > Components: SparkR >Affects Versions: 2.2.0 >Reporter: Miao Wang >Assignee: Apache Spark > > Linear SVM classifier is newly added into ML and python API has been added. > This JIRA is to add R side API. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19456) Add LinearSVC R API
Miao Wang created SPARK-19456: - Summary: Add LinearSVC R API Key: SPARK-19456 URL: https://issues.apache.org/jira/browse/SPARK-19456 Project: Spark Issue Type: New Feature Components: SparkR Affects Versions: 2.2.0 Reporter: Miao Wang Linear SVM classifier is newly added into ML and python API has been added. This JIRA is to add R side API. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18873) New test cases for scalar subquery
[ https://issues.apache.org/jira/browse/SPARK-18873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15852439#comment-15852439 ] Apache Spark commented on SPARK-18873: -- User 'nsyca' has created a pull request for this issue: https://github.com/apache/spark/pull/16798 > New test cases for scalar subquery > -- > > Key: SPARK-18873 > URL: https://issues.apache.org/jira/browse/SPARK-18873 > Project: Spark > Issue Type: Sub-task > Components: SQL, Tests >Reporter: Nattavut Sutyanyong > > This JIRA is for submitting a PR for new test cases on scalar subquery. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19418) Dataset generated java code fails to compile as java.lang.Long does not accept UTF8String in constructor
[ https://issues.apache.org/jira/browse/SPARK-19418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Suresh Avadhanula updated SPARK-19418: -- Priority: Minor (was: Major) > Dataset generated java code fails to compile as java.lang.Long does not > accept UTF8String in constructor > > > Key: SPARK-19418 > URL: https://issues.apache.org/jira/browse/SPARK-19418 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Suresh Avadhanula >Priority: Minor > Attachments: encodertest.zip > > > I have the following in Java spark driver. > DealerPerson module object is > {code:title=DealerPerson.java|borderStyle=solid} > public class DealerPerson > { > Long schemaOrgUnitId ; > List personList > } > {code} > I populate it using group by as follows. > {code} > Dataset dps = persondds.groupByKey(new MapFunctionLong>() { > @Override > public Long call(Person person) throws Exception { > return person.getSchemaOrgUnitId(); > } > }, Encoders.LONG()). > mapGroups(new MapGroupsFunction () > { > @Override > public DealerPerson call(Long dp, > java.util.Iterator iterator) throws Exception { > DealerPerson retDp = new DealerPerson(); > retDp.setSchemaOrgUnitId(dp); > ArrayList persons = new ArrayList(); > while (iterator.hasNext()) > persons.add(iterator.next()); > retDp.setPersons(persons); > return retDp; > } > }, Encoders.bean(DealerPerson.class)); > {code} > The generated code throws compiler exception since UTF8String is > java.lang.Long() > {noformat} > 7/01/31 20:32:28 INFO TaskSetManager: Starting task 0.0 in stage 5.0 (TID 5, > localhost, executor driver, partition 0, PROCESS_LOCAL, 6442 bytes) > 17/01/31 20:32:28 INFO Executor: Running task 0.0 in stage 5.0 (TID 5) > 17/01/31 20:32:28 ERROR CodeGenerator: failed to compile: > org.codehaus.commons.compiler.CompileException: File 'generated.java', Line > 56, Column 58: No applicable constructor/method found for actual parameters > "org.apache.spark.unsafe.types.UTF8String"; candidates are: > "java.lang.Long(long)", "java.lang.Long(java.lang.String)" > /* 001 */ public java.lang.Object generate(Object[] references) { > /* 002 */ return new SpecificSafeProjection(references); > /* 003 */ } > /* 004 */ > /* 005 */ class SpecificSafeProjection extends > org.apache.spark.sql.catalyst.expressions.codegen.BaseProjection { > /* 006 */ > /* 007 */ private Object[] references; > /* 008 */ private InternalRow mutableRow; > /* 009 */ private com.xtime.spark.model.Person javaBean; > /* 010 */ private boolean resultIsNull; > /* 011 */ private UTF8String argValue; > /* 012 */ private boolean resultIsNull1; > /* 013 */ private UTF8String argValue1; > /* 014 */ private boolean resultIsNull2; > /* 015 */ private UTF8String argValue2; > /* 016 */ private boolean resultIsNull3; > /* 017 */ private UTF8String argValue3; > /* 018 */ private boolean resultIsNull4; > /* 019 */ private UTF8String argValue4; > /* 020 */ > /* 021 */ public SpecificSafeProjection(Object[] references) { > /* 022 */ this.references = references; > /* 023 */ mutableRow = (InternalRow) references[references.length - 1]; > /* 024 */ > /* 025 */ > /* 026 */ > /* 027 */ > /* 028 */ > /* 029 */ > /* 030 */ > /* 031 */ > /* 032 */ > /* 033 */ > /* 034 */ > /* 035 */ > /* 036 */ } > /* 037 */ > /* 038 */ public void initialize(int partitionIndex) { > /* 039 */ > /* 040 */ } > /* 041 */ > /* 042 */ > /* 043 */ private void apply_4(InternalRow i) { > /* 044 */ > /* 045 */ > /* 046 */ resultIsNull1 = false; > /* 047 */ if (!resultIsNull1) { > /* 048 */ > /* 049 */ boolean isNull21 = i.isNullAt(17); > /* 050 */ UTF8String value21 = isNull21 ? null : (i.getUTF8String(17)); > /* 051 */ resultIsNull1 = isNull21; > /* 052 */ argValue1 = value21; > /* 053 */ } > /* 054 */ > /* 055 */ > /* 056 */ final java.lang.Long value20 = resultIsNull1 ? null : new > java.lang.Long(argValue1); > /* 057 */ javaBean.setSchemaOrgUnitId(value20); > /* 058 */ > /* 059 */ > /* 060 */ resultIsNull2 = false; > /* 061 */ if (!resultIsNull2) { > /* 062 */ > /* 063 */ boolean isNull23 = i.isNullAt(0); > /* 064 */ UTF8String value23 = isNull23 ? null : (i.getUTF8String(0)); > /* 065 */ resultIsNull2 = isNull23; > /* 066 */ argValue2 = value23; > /* 067 */ } > /*
[jira] [Commented] (SPARK-19418) Dataset generated java code fails to compile as java.lang.Long does not accept UTF8String in constructor
[ https://issues.apache.org/jira/browse/SPARK-19418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15852416#comment-15852416 ] Suresh Avadhanula commented on SPARK-19418: --- Figure out the "workaround". Solution is based on [StackOverflow thread | http://stackoverflow.com/questions/41995843/spark-compileexception-in-dataset-groupbykey ]. The following works {code:title Reading from CSV} Dataset personDs = sqlContext .read() .option("header", true) .option("inferSchema", true) .csv("person10k.csv") .as(Encoders.bean(Person.class)); personDs.printSchema(); personDs.show(10); {code} option("inferSchema", true) and Encoders.bean() seem to be counter intuitive. > Dataset generated java code fails to compile as java.lang.Long does not > accept UTF8String in constructor > > > Key: SPARK-19418 > URL: https://issues.apache.org/jira/browse/SPARK-19418 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Suresh Avadhanula > Attachments: encodertest.zip > > > I have the following in Java spark driver. > DealerPerson module object is > {code:title=DealerPerson.java|borderStyle=solid} > public class DealerPerson > { > Long schemaOrgUnitId ; > List personList > } > {code} > I populate it using group by as follows. > {code} > Dataset dps = persondds.groupByKey(new MapFunctionLong>() { > @Override > public Long call(Person person) throws Exception { > return person.getSchemaOrgUnitId(); > } > }, Encoders.LONG()). > mapGroups(new MapGroupsFunction () > { > @Override > public DealerPerson call(Long dp, > java.util.Iterator iterator) throws Exception { > DealerPerson retDp = new DealerPerson(); > retDp.setSchemaOrgUnitId(dp); > ArrayList persons = new ArrayList(); > while (iterator.hasNext()) > persons.add(iterator.next()); > retDp.setPersons(persons); > return retDp; > } > }, Encoders.bean(DealerPerson.class)); > {code} > The generated code throws compiler exception since UTF8String is > java.lang.Long() > {noformat} > 7/01/31 20:32:28 INFO TaskSetManager: Starting task 0.0 in stage 5.0 (TID 5, > localhost, executor driver, partition 0, PROCESS_LOCAL, 6442 bytes) > 17/01/31 20:32:28 INFO Executor: Running task 0.0 in stage 5.0 (TID 5) > 17/01/31 20:32:28 ERROR CodeGenerator: failed to compile: > org.codehaus.commons.compiler.CompileException: File 'generated.java', Line > 56, Column 58: No applicable constructor/method found for actual parameters > "org.apache.spark.unsafe.types.UTF8String"; candidates are: > "java.lang.Long(long)", "java.lang.Long(java.lang.String)" > /* 001 */ public java.lang.Object generate(Object[] references) { > /* 002 */ return new SpecificSafeProjection(references); > /* 003 */ } > /* 004 */ > /* 005 */ class SpecificSafeProjection extends > org.apache.spark.sql.catalyst.expressions.codegen.BaseProjection { > /* 006 */ > /* 007 */ private Object[] references; > /* 008 */ private InternalRow mutableRow; > /* 009 */ private com.xtime.spark.model.Person javaBean; > /* 010 */ private boolean resultIsNull; > /* 011 */ private UTF8String argValue; > /* 012 */ private boolean resultIsNull1; > /* 013 */ private UTF8String argValue1; > /* 014 */ private boolean resultIsNull2; > /* 015 */ private UTF8String argValue2; > /* 016 */ private boolean resultIsNull3; > /* 017 */ private UTF8String argValue3; > /* 018 */ private boolean resultIsNull4; > /* 019 */ private UTF8String argValue4; > /* 020 */ > /* 021 */ public SpecificSafeProjection(Object[] references) { > /* 022 */ this.references = references; > /* 023 */ mutableRow = (InternalRow) references[references.length - 1]; > /* 024 */ > /* 025 */ > /* 026 */ > /* 027 */ > /* 028 */ > /* 029 */ > /* 030 */ > /* 031 */ > /* 032 */ > /* 033 */ > /* 034 */ > /* 035 */ > /* 036 */ } > /* 037 */ > /* 038 */ public void initialize(int partitionIndex) { > /* 039 */ > /* 040 */ } > /* 041 */ > /* 042 */ > /* 043 */ private void apply_4(InternalRow i) { > /* 044 */ > /* 045 */ > /* 046 */ resultIsNull1 = false; > /* 047 */ if (!resultIsNull1) { > /* 048 */ > /* 049 */ boolean isNull21 = i.isNullAt(17); > /* 050 */ UTF8String value21 = isNull21 ? null : (i.getUTF8String(17)); > /* 051 */ resultIsNull1 = isNull21; > /* 052 */
[jira] [Commented] (SPARK-19455) Add option for case-insensitive Parquet field resolution
[ https://issues.apache.org/jira/browse/SPARK-19455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15852242#comment-15852242 ] Apache Spark commented on SPARK-19455: -- User 'budde' has created a pull request for this issue: https://github.com/apache/spark/pull/16797 > Add option for case-insensitive Parquet field resolution > > > Key: SPARK-19455 > URL: https://issues.apache.org/jira/browse/SPARK-19455 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Adam Budde > > [SPARK-16980|https://issues.apache.org/jira/browse/SPARK-16980] removed the > schema inferrence from the HiveMetastoreCatalog class when converting a > MetastoreRelation to a LoigcalRelation (HadoopFsRelation, in this case) in > favor of simply using the schema returend by the metastore. This results in > an optimization as the underlying file status no longer need to be resolved > until after the partition pruning step, reducing the number of files to be > touched significantly in some cases. The downside is that the data schema > used may no longer match the underlying file schema for case-sensitive > formats such as Parquet. > This change initially included a [patch to > ParquetReadSupport|https://github.com/apache/spark/blob/6ce1b675ee9fc9a6034439c3ca00441f9f172f84/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetReadSupport.scala#L270-L284] > that attempted to remedy this conflict by using a case-insentive fallback > mapping when resolving field names during the schema clipping step. > [SPARK-1833|https://issues.apache.org/jira/browse/SPARK-18333] later removed > this patch after > [SPARK-17183|https://issues.apache.org/jira/browse/SPARK-17183] added support > for embedding a case-sensitive schema as a Hive Metastore table property. > AFAIK the assumption here was that the data schema obtained from the > Metastore table property will be case sensitive and should match the Parquet > schema exactly. > The problem arises when dealing with Parquet-backed tables for which this > schema has not been embedded as a table attributes and for which the > underlying files contain case-sensitive field names. This will happen for any > Hive table that was not created by Spark or created by a version prior to > 2.1.0. We've seen Spark SQL return no results for any query containing a > case-sensitive field name for such tables. > The change we're proposing is to introduce a configuration parameter that > will re-enable case-insensitive field name resolution in ParquetReadSupport. > This option will also disable filter push-down for Parquet, as the filter > predicate constructed by Spark SQL contains the case-insensitive field names > which Parquet will return 0 records for when filtering against a > case-sensitive column name. I was hoping to find a way to construct the > filter on-the-fly in ParquetReadSupport but Parquet doesn't propegate the > Configuration object passed to this class to the underlying > InternalParquetRecordReader class. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19455) Add option for case-insensitive Parquet field resolution
[ https://issues.apache.org/jira/browse/SPARK-19455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19455: Assignee: (was: Apache Spark) > Add option for case-insensitive Parquet field resolution > > > Key: SPARK-19455 > URL: https://issues.apache.org/jira/browse/SPARK-19455 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Adam Budde > > [SPARK-16980|https://issues.apache.org/jira/browse/SPARK-16980] removed the > schema inferrence from the HiveMetastoreCatalog class when converting a > MetastoreRelation to a LoigcalRelation (HadoopFsRelation, in this case) in > favor of simply using the schema returend by the metastore. This results in > an optimization as the underlying file status no longer need to be resolved > until after the partition pruning step, reducing the number of files to be > touched significantly in some cases. The downside is that the data schema > used may no longer match the underlying file schema for case-sensitive > formats such as Parquet. > This change initially included a [patch to > ParquetReadSupport|https://github.com/apache/spark/blob/6ce1b675ee9fc9a6034439c3ca00441f9f172f84/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetReadSupport.scala#L270-L284] > that attempted to remedy this conflict by using a case-insentive fallback > mapping when resolving field names during the schema clipping step. > [SPARK-1833|https://issues.apache.org/jira/browse/SPARK-18333] later removed > this patch after > [SPARK-17183|https://issues.apache.org/jira/browse/SPARK-17183] added support > for embedding a case-sensitive schema as a Hive Metastore table property. > AFAIK the assumption here was that the data schema obtained from the > Metastore table property will be case sensitive and should match the Parquet > schema exactly. > The problem arises when dealing with Parquet-backed tables for which this > schema has not been embedded as a table attributes and for which the > underlying files contain case-sensitive field names. This will happen for any > Hive table that was not created by Spark or created by a version prior to > 2.1.0. We've seen Spark SQL return no results for any query containing a > case-sensitive field name for such tables. > The change we're proposing is to introduce a configuration parameter that > will re-enable case-insensitive field name resolution in ParquetReadSupport. > This option will also disable filter push-down for Parquet, as the filter > predicate constructed by Spark SQL contains the case-insensitive field names > which Parquet will return 0 records for when filtering against a > case-sensitive column name. I was hoping to find a way to construct the > filter on-the-fly in ParquetReadSupport but Parquet doesn't propegate the > Configuration object passed to this class to the underlying > InternalParquetRecordReader class. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19455) Add option for case-insensitive Parquet field resolution
[ https://issues.apache.org/jira/browse/SPARK-19455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19455: Assignee: Apache Spark > Add option for case-insensitive Parquet field resolution > > > Key: SPARK-19455 > URL: https://issues.apache.org/jira/browse/SPARK-19455 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Adam Budde >Assignee: Apache Spark > > [SPARK-16980|https://issues.apache.org/jira/browse/SPARK-16980] removed the > schema inferrence from the HiveMetastoreCatalog class when converting a > MetastoreRelation to a LoigcalRelation (HadoopFsRelation, in this case) in > favor of simply using the schema returend by the metastore. This results in > an optimization as the underlying file status no longer need to be resolved > until after the partition pruning step, reducing the number of files to be > touched significantly in some cases. The downside is that the data schema > used may no longer match the underlying file schema for case-sensitive > formats such as Parquet. > This change initially included a [patch to > ParquetReadSupport|https://github.com/apache/spark/blob/6ce1b675ee9fc9a6034439c3ca00441f9f172f84/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetReadSupport.scala#L270-L284] > that attempted to remedy this conflict by using a case-insentive fallback > mapping when resolving field names during the schema clipping step. > [SPARK-1833|https://issues.apache.org/jira/browse/SPARK-18333] later removed > this patch after > [SPARK-17183|https://issues.apache.org/jira/browse/SPARK-17183] added support > for embedding a case-sensitive schema as a Hive Metastore table property. > AFAIK the assumption here was that the data schema obtained from the > Metastore table property will be case sensitive and should match the Parquet > schema exactly. > The problem arises when dealing with Parquet-backed tables for which this > schema has not been embedded as a table attributes and for which the > underlying files contain case-sensitive field names. This will happen for any > Hive table that was not created by Spark or created by a version prior to > 2.1.0. We've seen Spark SQL return no results for any query containing a > case-sensitive field name for such tables. > The change we're proposing is to introduce a configuration parameter that > will re-enable case-insensitive field name resolution in ParquetReadSupport. > This option will also disable filter push-down for Parquet, as the filter > predicate constructed by Spark SQL contains the case-insensitive field names > which Parquet will return 0 records for when filtering against a > case-sensitive column name. I was hoping to find a way to construct the > filter on-the-fly in ParquetReadSupport but Parquet doesn't propegate the > Configuration object passed to this class to the underlying > InternalParquetRecordReader class. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19455) Add option for case-insensitive Parquet field resolution
Adam Budde created SPARK-19455: -- Summary: Add option for case-insensitive Parquet field resolution Key: SPARK-19455 URL: https://issues.apache.org/jira/browse/SPARK-19455 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.1.0 Reporter: Adam Budde [SPARK-16980|https://issues.apache.org/jira/browse/SPARK-16980] removed the schema inferrence from the HiveMetastoreCatalog class when converting a MetastoreRelation to a LoigcalRelation (HadoopFsRelation, in this case) in favor of simply using the schema returend by the metastore. This results in an optimization as the underlying file status no longer need to be resolved until after the partition pruning step, reducing the number of files to be touched significantly in some cases. The downside is that the data schema used may no longer match the underlying file schema for case-sensitive formats such as Parquet. This change initially included a [patch to ParquetReadSupport|https://github.com/apache/spark/blob/6ce1b675ee9fc9a6034439c3ca00441f9f172f84/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetReadSupport.scala#L270-L284] that attempted to remedy this conflict by using a case-insentive fallback mapping when resolving field names during the schema clipping step. [SPARK-1833|https://issues.apache.org/jira/browse/SPARK-18333] later removed this patch after [SPARK-17183|https://issues.apache.org/jira/browse/SPARK-17183] added support for embedding a case-sensitive schema as a Hive Metastore table property. AFAIK the assumption here was that the data schema obtained from the Metastore table property will be case sensitive and should match the Parquet schema exactly. The problem arises when dealing with Parquet-backed tables for which this schema has not been embedded as a table attributes and for which the underlying files contain case-sensitive field names. This will happen for any Hive table that was not created by Spark or created by a version prior to 2.1.0. We've seen Spark SQL return no results for any query containing a case-sensitive field name for such tables. The change we're proposing is to introduce a configuration parameter that will re-enable case-insensitive field name resolution in ParquetReadSupport. This option will also disable filter push-down for Parquet, as the filter predicate constructed by Spark SQL contains the case-insensitive field names which Parquet will return 0 records for when filtering against a case-sensitive column name. I was hoping to find a way to construct the filter on-the-fly in ParquetReadSupport but Parquet doesn't propegate the Configuration object passed to this class to the underlying InternalParquetRecordReader class. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10063) Remove DirectParquetOutputCommitter
[ https://issues.apache.org/jira/browse/SPARK-10063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15852170#comment-15852170 ] Apache Spark commented on SPARK-10063: -- User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/16796 > Remove DirectParquetOutputCommitter > --- > > Key: SPARK-10063 > URL: https://issues.apache.org/jira/browse/SPARK-10063 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Yin Huai >Assignee: Reynold Xin >Priority: Critical > Fix For: 2.0.0 > > > When we use DirectParquetOutputCommitter on S3 and speculation is enabled, > there is a chance that we can loss data. > Here is the code to reproduce the problem. > {code} > import org.apache.spark.sql.functions._ > val failSpeculativeTask = sqlContext.udf.register("failSpeculativeTask", (i: > Int, partitionId: Int, attemptNumber: Int) => { > if (partitionId == 0 && i == 5) { > if (attemptNumber > 0) { > Thread.sleep(15000) > throw new Exception("new exception") > } else { > Thread.sleep(1) > } > } > > i > }) > val df = sc.parallelize((1 to 100), 20).mapPartitions { iter => > val context = org.apache.spark.TaskContext.get() > val partitionId = context.partitionId > val attemptNumber = context.attemptNumber > iter.map(i => (i, partitionId, attemptNumber)) > }.toDF("i", "partitionId", "attemptNumber") > df > .select(failSpeculativeTask($"i", $"partitionId", > $"attemptNumber").as("i"), $"partitionId", $"attemptNumber") > .write.mode("overwrite").format("parquet").save("/home/yin/outputCommitter") > sqlContext.read.load("/home/yin/outputCommitter").count > // The result is 99 and 5 is missing from the output. > {code} > What happened is that the original task finishes first and uploads its output > file to S3, then the speculative task somehow fails. Because we have to call > output stream's close method, which uploads data to S3, we actually uploads > the partial result generated by the failed speculative task to S3 and this > file overwrites the correct file generated by the original task. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17161) Add PySpark-ML JavaWrapper convenience function to create py4j JavaArrays
[ https://issues.apache.org/jira/browse/SPARK-17161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] holdenk reassigned SPARK-17161: --- Assignee: Bryan Cutler Affects Version/s: 2.2.0 > Add PySpark-ML JavaWrapper convenience function to create py4j JavaArrays > - > > Key: SPARK-17161 > URL: https://issues.apache.org/jira/browse/SPARK-17161 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 2.2.0 >Reporter: Bryan Cutler >Assignee: Bryan Cutler >Priority: Minor > Fix For: 2.2.0 > > > Often in Spark ML, there are classes that use a Scala Array in a constructor. > In order to add the same API to Python, a Java-friendly alternate > constructor needs to exist to be compatible with py4j when converting from a > list. This is because the current conversion in PySpark _py2java creates a > java.util.ArrayList, as shown in this error msg > {noformat} > Py4JError: An error occurred while calling > None.org.apache.spark.ml.feature.CountVectorizerModel. Trace: > py4j.Py4JException: Constructor > org.apache.spark.ml.feature.CountVectorizerModel([class java.util.ArrayList]) > does not exist > at > py4j.reflection.ReflectionEngine.getConstructor(ReflectionEngine.java:179) > at > py4j.reflection.ReflectionEngine.getConstructor(ReflectionEngine.java:196) > at py4j.Gateway.invoke(Gateway.java:235) > {noformat} > Creating an alternate constructor can be avoided by creating a py4j JavaArray > using {{new_array}}. This type is compatible with the Scala Array currently > used in classes like {{CountVectorizerModel}} and {{StringIndexerModel}}. > Most of the boiler-plate Python code to do this can be put in a convenience > function inside of ml.JavaWrapper to give a clean way of constructing ML > objects without adding special constructors. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19409) Upgrade Parquet to 1.8.2
[ https://issues.apache.org/jira/browse/SPARK-19409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15852090#comment-15852090 ] Apache Spark commented on SPARK-19409: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/16795 > Upgrade Parquet to 1.8.2 > > > Key: SPARK-19409 > URL: https://issues.apache.org/jira/browse/SPARK-19409 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.1.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun > Fix For: 2.2.0 > > > Apache Parquet 1.8.2 is released officially last week on 26 Jan. > This issue aims to bump Parquet version to 1.8.2 since it includes many fixes. > https://lists.apache.org/thread.html/af0c813f1419899289a336d96ec02b3bbeecaea23aa6ef69f435c142@%3Cdev.parquet.apache.org%3E -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19326) Speculated task attempts do not get launched in few scenarios
[ https://issues.apache.org/jira/browse/SPARK-19326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15852085#comment-15852085 ] Kay Ousterhout commented on SPARK-19326: I see that makes sense; thanks for the additional explanation. [~andrewor14] did you think about this issue when implementing dynamic allocation originally? I noticed there'a a [comment saying that speculation is not considered for simplicity](https://github.com/apache/spark/blob/6ee28423ad1b2e6089b82af64a31d77d3552bb38/core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala#L579), but it does seem like this functionality can prevent speculation from occurring. > Speculated task attempts do not get launched in few scenarios > - > > Key: SPARK-19326 > URL: https://issues.apache.org/jira/browse/SPARK-19326 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.0.2, 2.1.0 >Reporter: Tejas Patil > > Speculated copies of tasks do not get launched in some cases. > Examples: > - All the running executors have no CPU slots left to accommodate a > speculated copy of the task(s). If the all running executors reside over a > set of slow / bad hosts, they will keep the job running for long time > - `spark.task.cpus` > 1 and the running executor has not filled up all its > CPU slots. Since the [speculated copies of tasks should run on different > host|https://github.com/apache/spark/blob/2e139eed3194c7b8814ff6cf007d4e8a874c1e4d/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala#L283] > and not the host where the first copy was launched. > In both these cases, `ExecutorAllocationManager` does not know about pending > speculation task attempts and thinks that all the resource demands are well > taken care of. ([relevant > code|https://github.com/apache/spark/blob/6ee28423ad1b2e6089b82af64a31d77d3552bb38/core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala#L265]) > This adds variation in the job completion times and more importantly SLA > misses :( In prod, with a large number of jobs, I see this happening more > often than one would think. Chasing the bad hosts or reason for slowness > doesn't scale. > Here is a tiny repro. Note that you need to launch this with (Mesos or YARN > or standalone deploy mode) along with `--conf spark.speculation=true --conf > spark.executor.cores=4 --conf spark.dynamicAllocation.maxExecutors=100` > {code} > val n = 100 > val someRDD = sc.parallelize(1 to n, n) > someRDD.mapPartitionsWithIndex( (index: Int, it: Iterator[Int]) => { > if (index == 1) { > Thread.sleep(Long.MaxValue) // fake long running task(s) > } > it.toList.map(x => index + ", " + x).iterator > }).collect > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19452) Fix bug in the name assignment method in SparkR
[ https://issues.apache.org/jira/browse/SPARK-19452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19452: Assignee: (was: Apache Spark) > Fix bug in the name assignment method in SparkR > --- > > Key: SPARK-19452 > URL: https://issues.apache.org/jira/browse/SPARK-19452 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.1.0, 2.2.0 >Reporter: Wayne Zhang > > The names method fails to check for validity of the assignment values. This > can be fixed by calling colnames within names. See example below. > {code} > df <- suppressWarnings(createDataFrame(iris)) > # this is error > colnames(df) <- NULL > # this should report error > names(df) <- NULL > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19452) Fix bug in the name assignment method in SparkR
[ https://issues.apache.org/jira/browse/SPARK-19452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19452: Assignee: Apache Spark > Fix bug in the name assignment method in SparkR > --- > > Key: SPARK-19452 > URL: https://issues.apache.org/jira/browse/SPARK-19452 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.1.0, 2.2.0 >Reporter: Wayne Zhang >Assignee: Apache Spark > > The names method fails to check for validity of the assignment values. This > can be fixed by calling colnames within names. See example below. > {code} > df <- suppressWarnings(createDataFrame(iris)) > # this is error > colnames(df) <- NULL > # this should report error > names(df) <- NULL > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19452) Fix bug in the name assignment method in SparkR
[ https://issues.apache.org/jira/browse/SPARK-19452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15852069#comment-15852069 ] Apache Spark commented on SPARK-19452: -- User 'actuaryzhang' has created a pull request for this issue: https://github.com/apache/spark/pull/16794 > Fix bug in the name assignment method in SparkR > --- > > Key: SPARK-19452 > URL: https://issues.apache.org/jira/browse/SPARK-19452 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.1.0, 2.2.0 >Reporter: Wayne Zhang > > The names method fails to check for validity of the assignment values. This > can be fixed by calling colnames within names. See example below. > {code} > df <- suppressWarnings(createDataFrame(iris)) > # this is error > colnames(df) <- NULL > # this should report error > names(df) <- NULL > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19386) Bisecting k-means in SparkR documentation
[ https://issues.apache.org/jira/browse/SPARK-19386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix Cheung reassigned SPARK-19386: Assignee: Krishna Kalyan (was: Miao Wang) > Bisecting k-means in SparkR documentation > - > > Key: SPARK-19386 > URL: https://issues.apache.org/jira/browse/SPARK-19386 > Project: Spark > Issue Type: Bug > Components: ML, SparkR >Affects Versions: 2.2.0 >Reporter: Felix Cheung >Assignee: Krishna Kalyan > Fix For: 2.2.0 > > > we need updates to programming guide, example and vignettes -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19386) Bisecting k-means in SparkR documentation
[ https://issues.apache.org/jira/browse/SPARK-19386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix Cheung resolved SPARK-19386. -- Resolution: Fixed Fix Version/s: 2.2.0 > Bisecting k-means in SparkR documentation > - > > Key: SPARK-19386 > URL: https://issues.apache.org/jira/browse/SPARK-19386 > Project: Spark > Issue Type: Bug > Components: ML, SparkR >Affects Versions: 2.2.0 >Reporter: Felix Cheung >Assignee: Krishna Kalyan > Fix For: 2.2.0 > > > we need updates to programming guide, example and vignettes -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19454) Improve DataFrame.replace API
[ https://issues.apache.org/jira/browse/SPARK-19454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19454: Assignee: (was: Apache Spark) > Improve DataFrame.replace API > - > > Key: SPARK-19454 > URL: https://issues.apache.org/jira/browse/SPARK-19454 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 1.5.0, 1.6.0, 2.0.0, 2.1.0, 2.2.0 >Reporter: Maciej Szymkiewicz > > Current implementation suffers from following issues: > - It is possible to use {{dict}} as {{to_replace}}, but we cannot skip or use > {{None}} as the value {{value}} (although it is ignored). This requires > passing "magic" values: > {code} > df = sc.parallelize([("Alice", 1, 3.0)]).toDF() > df.replace({"Alice": "Bob"}, 1) > {code} > - Code doesn't check if provided types are correct. This can lead to > exception in Py4j (harder to diagnose): > {code} > df.replace({"Alice": 1}, 1) > {code} > or silent failures (with bundled Py4j version): > {code} > df.replace({1: 2, 3.0: 4.1, "a": "b"}, 1) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19454) Improve DataFrame.replace API
[ https://issues.apache.org/jira/browse/SPARK-19454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19454: Assignee: Apache Spark > Improve DataFrame.replace API > - > > Key: SPARK-19454 > URL: https://issues.apache.org/jira/browse/SPARK-19454 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 1.5.0, 1.6.0, 2.0.0, 2.1.0, 2.2.0 >Reporter: Maciej Szymkiewicz >Assignee: Apache Spark > > Current implementation suffers from following issues: > - It is possible to use {{dict}} as {{to_replace}}, but we cannot skip or use > {{None}} as the value {{value}} (although it is ignored). This requires > passing "magic" values: > {code} > df = sc.parallelize([("Alice", 1, 3.0)]).toDF() > df.replace({"Alice": "Bob"}, 1) > {code} > - Code doesn't check if provided types are correct. This can lead to > exception in Py4j (harder to diagnose): > {code} > df.replace({"Alice": 1}, 1) > {code} > or silent failures (with bundled Py4j version): > {code} > df.replace({1: 2, 3.0: 4.1, "a": "b"}, 1) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19454) Improve DataFrame.replace API
[ https://issues.apache.org/jira/browse/SPARK-19454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15852046#comment-15852046 ] Apache Spark commented on SPARK-19454: -- User 'zero323' has created a pull request for this issue: https://github.com/apache/spark/pull/16793 > Improve DataFrame.replace API > - > > Key: SPARK-19454 > URL: https://issues.apache.org/jira/browse/SPARK-19454 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 1.5.0, 1.6.0, 2.0.0, 2.1.0, 2.2.0 >Reporter: Maciej Szymkiewicz > > Current implementation suffers from following issues: > - It is possible to use {{dict}} as {{to_replace}}, but we cannot skip or use > {{None}} as the value {{value}} (although it is ignored). This requires > passing "magic" values: > {code} > df = sc.parallelize([("Alice", 1, 3.0)]).toDF() > df.replace({"Alice": "Bob"}, 1) > {code} > - Code doesn't check if provided types are correct. This can lead to > exception in Py4j (harder to diagnose): > {code} > df.replace({"Alice": 1}, 1) > {code} > or silent failures (with bundled Py4j version): > {code} > df.replace({1: 2, 3.0: 4.1, "a": "b"}, 1) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19454) Improve DataFrame.replace API
Maciej Szymkiewicz created SPARK-19454: -- Summary: Improve DataFrame.replace API Key: SPARK-19454 URL: https://issues.apache.org/jira/browse/SPARK-19454 Project: Spark Issue Type: Improvement Components: PySpark, SQL Affects Versions: 2.1.0, 2.0.0, 1.6.0, 1.5.0, 2.2.0 Reporter: Maciej Szymkiewicz Current implementation suffers from following issues: - It is possible to use {{dict}} as {{to_replace}}, but we cannot skip or use {{None}} as the value {{value}} (although it is ignored). This requires passing "magic" values: {code} df = sc.parallelize([("Alice", 1, 3.0)]).toDF() df.replace({"Alice": "Bob"}, 1) {code} - Code doesn't check if provided types are correct. This can lead to exception in Py4j (harder to diagnose): {code} df.replace({"Alice": 1}, 1) {code} or silent failures (with bundled Py4j version): {code} df.replace({1: 2, 3.0: 4.1, "a": "b"}, 1) {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19453) Correct DataFrame.replace docs
[ https://issues.apache.org/jira/browse/SPARK-19453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19453: Assignee: Apache Spark > Correct DataFrame.replace docs > -- > > Key: SPARK-19453 > URL: https://issues.apache.org/jira/browse/SPARK-19453 > Project: Spark > Issue Type: Improvement > Components: Documentation, PySpark, SQL >Affects Versions: 1.5.0, 1.6.0, 2.0.0, 2.1.0, 2.2.0 >Reporter: Maciej Szymkiewicz >Assignee: Apache Spark > > Current docstring provides incorrect description of {{to_replace}} argument: > {quote} > If the value is a dict, then `value` is ignored and `to_replace` must be a > mapping from column name (string) to replacement value. > {quote} > It looks like it has been copied from `na.fill` docs. In fact {{dict}} should > provide mapping from value to replacement value. > Moreover docs fail to explain some fundamental limitations (like lack of > support for heterogeneous values) and some usage scenarios. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19453) Correct DataFrame.replace docs
[ https://issues.apache.org/jira/browse/SPARK-19453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19453: Assignee: (was: Apache Spark) > Correct DataFrame.replace docs > -- > > Key: SPARK-19453 > URL: https://issues.apache.org/jira/browse/SPARK-19453 > Project: Spark > Issue Type: Improvement > Components: Documentation, PySpark, SQL >Affects Versions: 1.5.0, 1.6.0, 2.0.0, 2.1.0, 2.2.0 >Reporter: Maciej Szymkiewicz > > Current docstring provides incorrect description of {{to_replace}} argument: > {quote} > If the value is a dict, then `value` is ignored and `to_replace` must be a > mapping from column name (string) to replacement value. > {quote} > It looks like it has been copied from `na.fill` docs. In fact {{dict}} should > provide mapping from value to replacement value. > Moreover docs fail to explain some fundamental limitations (like lack of > support for heterogeneous values) and some usage scenarios. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19453) Correct DataFrame.replace docs
[ https://issues.apache.org/jira/browse/SPARK-19453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19453: Assignee: Apache Spark > Correct DataFrame.replace docs > -- > > Key: SPARK-19453 > URL: https://issues.apache.org/jira/browse/SPARK-19453 > Project: Spark > Issue Type: Improvement > Components: Documentation, PySpark, SQL >Affects Versions: 1.5.0, 1.6.0, 2.0.0, 2.1.0, 2.2.0 >Reporter: Maciej Szymkiewicz >Assignee: Apache Spark > > Current docstring provides incorrect description of {{to_replace}} argument: > {quote} > If the value is a dict, then `value` is ignored and `to_replace` must be a > mapping from column name (string) to replacement value. > {quote} > It looks like it has been copied from `na.fill` docs. In fact {{dict}} should > provide mapping from value to replacement value. > Moreover docs fail to explain some fundamental limitations (like lack of > support for heterogeneous values) and some usage scenarios. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19453) Correct DataFrame.replace docs
[ https://issues.apache.org/jira/browse/SPARK-19453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15852017#comment-15852017 ] Apache Spark commented on SPARK-19453: -- User 'zero323' has created a pull request for this issue: https://github.com/apache/spark/pull/16792 > Correct DataFrame.replace docs > -- > > Key: SPARK-19453 > URL: https://issues.apache.org/jira/browse/SPARK-19453 > Project: Spark > Issue Type: Improvement > Components: Documentation, PySpark, SQL >Affects Versions: 1.5.0, 1.6.0, 2.0.0, 2.1.0, 2.2.0 >Reporter: Maciej Szymkiewicz > > Current docstring provides incorrect description of {{to_replace}} argument: > {quote} > If the value is a dict, then `value` is ignored and `to_replace` must be a > mapping from column name (string) to replacement value. > {quote} > It looks like it has been copied from `na.fill` docs. In fact {{dict}} should > provide mapping from value to replacement value. > Moreover docs fail to explain some fundamental limitations (like lack of > support for heterogeneous values) and some usage scenarios. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19453) Correct DataFrame.replace docs
[ https://issues.apache.org/jira/browse/SPARK-19453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19453: Assignee: (was: Apache Spark) > Correct DataFrame.replace docs > -- > > Key: SPARK-19453 > URL: https://issues.apache.org/jira/browse/SPARK-19453 > Project: Spark > Issue Type: Improvement > Components: Documentation, PySpark, SQL >Affects Versions: 1.5.0, 1.6.0, 2.0.0, 2.1.0, 2.2.0 >Reporter: Maciej Szymkiewicz > > Current docstring provides incorrect description of {{to_replace}} argument: > {quote} > If the value is a dict, then `value` is ignored and `to_replace` must be a > mapping from column name (string) to replacement value. > {quote} > It looks like it has been copied from `na.fill` docs. In fact {{dict}} should > provide mapping from value to replacement value. > Moreover docs fail to explain some fundamental limitations (like lack of > support for heterogeneous values) and some usage scenarios. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19453) Correct DataFrame.replace docs
[ https://issues.apache.org/jira/browse/SPARK-19453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maciej Szymkiewicz updated SPARK-19453: --- Summary: Correct DataFrame.replace docs (was: Correct Column.replace docs) > Correct DataFrame.replace docs > -- > > Key: SPARK-19453 > URL: https://issues.apache.org/jira/browse/SPARK-19453 > Project: Spark > Issue Type: Improvement > Components: Documentation, PySpark, SQL >Affects Versions: 1.5.0, 1.6.0, 2.0.0, 2.1.0, 2.2.0 >Reporter: Maciej Szymkiewicz > > Current docstring provides incorrect description of {{to_replace}} argument: > {quote} > If the value is a dict, then `value` is ignored and `to_replace` must be a > mapping from column name (string) to replacement value. > {quote} > It looks like it has been copied from `na.fill` docs. In fact {{dict}} should > provide mapping from value to replacement value. > Moreover docs fail to explain some fundamental limitations (like lack of > support for heterogeneous values) and some usage scenarios. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19453) Correct Column.replace docs
[ https://issues.apache.org/jira/browse/SPARK-19453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maciej Szymkiewicz updated SPARK-19453: --- Summary: Correct Column.replace docs (was: Correct ) > Correct Column.replace docs > --- > > Key: SPARK-19453 > URL: https://issues.apache.org/jira/browse/SPARK-19453 > Project: Spark > Issue Type: Improvement > Components: Documentation, PySpark, SQL >Affects Versions: 1.5.0, 1.6.0, 2.0.0, 2.1.0, 2.2.0 >Reporter: Maciej Szymkiewicz > > Current docstring provides incorrect description of {{to_replace}} argument: > {quote} > If the value is a dict, then `value` is ignored and `to_replace` must be a > mapping from column name (string) to replacement value. > {quote} > It looks like it has been copied from `na.fill` docs. In fact {{dict}} should > provide mapping from value to replacement value. > Moreover docs fail to explain some fundamental limitations (like lack of > support for heterogeneous values) and some usage scenarios. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19453) Correct
Maciej Szymkiewicz created SPARK-19453: -- Summary: Correct Key: SPARK-19453 URL: https://issues.apache.org/jira/browse/SPARK-19453 Project: Spark Issue Type: Improvement Components: Documentation, PySpark, SQL Affects Versions: 2.1.0, 2.0.0, 1.6.0, 1.5.0, 2.2.0 Reporter: Maciej Szymkiewicz Current docstring provides incorrect description of {{to_replace}} argument: {quote} If the value is a dict, then `value` is ignored and `to_replace` must be a mapping from column name (string) to replacement value. {quote} It looks like it has been copied from `na.fill` docs. In fact {{dict}} should provide mapping from value to replacement value. Moreover docs fail to explain some fundamental limitations (like lack of support for heterogeneous values) and some usage scenarios. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19452) Fix bug in the name assignment method in SparkR
Wayne Zhang created SPARK-19452: --- Summary: Fix bug in the name assignment method in SparkR Key: SPARK-19452 URL: https://issues.apache.org/jira/browse/SPARK-19452 Project: Spark Issue Type: Bug Components: SparkR Affects Versions: 2.1.0, 2.2.0 Reporter: Wayne Zhang The names method fails to check for validity of the assignment values. This can be fixed by calling colnames within names. See example below. {code} df <- suppressWarnings(createDataFrame(iris)) # this is error colnames(df) <- NULL # this should report error names(df) <- NULL {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18539) Cannot filter by nonexisting column in parquet file
[ https://issues.apache.org/jira/browse/SPARK-18539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian resolved SPARK-18539. Resolution: Fixed Assignee: Dongjoon Hyun Target Version/s: 2.2.0 > Cannot filter by nonexisting column in parquet file > --- > > Key: SPARK-18539 > URL: https://issues.apache.org/jira/browse/SPARK-18539 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.1, 2.0.2 >Reporter: Vitaly Gerasimov >Assignee: Dongjoon Hyun >Priority: Critical > > {code} > import org.apache.spark.SparkConf > import org.apache.spark.sql.SparkSession > import org.apache.spark.sql.types.DataTypes._ > import org.apache.spark.sql.types.{StructField, StructType} > val sc = SparkSession.builder().config(new > SparkConf().setMaster("local")).getOrCreate() > val jsonRDD = sc.sparkContext.parallelize(Seq("""{"a":1}""")) > sc.read > .schema(StructType(Seq(StructField("a", IntegerType > .json(jsonRDD) > .write > .parquet("/tmp/test") > sc.read > .schema(StructType(Seq(StructField("a", IntegerType), StructField("b", > IntegerType, nullable = true > .load("/tmp/test") > .createOrReplaceTempView("table") > sc.sql("select b from table where b is not null").show() > {code} > returns: > {code} > 16/11/22 17:43:47 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1) > java.lang.IllegalArgumentException: Column [b] was not found in schema! > at org.apache.parquet.Preconditions.checkArgument(Preconditions.java:55) > at > org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:190) > at > org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:178) > at > org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:160) > at > org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:100) > at > org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:59) > at > org.apache.parquet.filter2.predicate.Operators$NotEq.accept(Operators.java:194) > at > org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validate(SchemaCompatibilityValidator.java:64) > at > org.apache.parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:59) > at > org.apache.parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:40) > at > org.apache.parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:126) > at > org.apache.parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:46) > at > org.apache.spark.sql.execution.datasources.parquet.SpecificParquetRecordReaderBase.initialize(SpecificParquetRecordReaderBase.java:110) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initialize(VectorizedParquetRecordReader.java:109) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReader$1.apply(ParquetFileFormat.scala:367) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReader$1.apply(ParquetFileFormat.scala:341) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:116) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.scan_nextBatch$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at
[jira] [Commented] (SPARK-18539) Cannot filter by nonexisting column in parquet file
[ https://issues.apache.org/jira/browse/SPARK-18539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15851965#comment-15851965 ] Cheng Lian commented on SPARK-18539: SPARK-19409 upgrades parquet-mr to 1.8.2 and fixed this issue. > Cannot filter by nonexisting column in parquet file > --- > > Key: SPARK-18539 > URL: https://issues.apache.org/jira/browse/SPARK-18539 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.1, 2.0.2 >Reporter: Vitaly Gerasimov >Priority: Critical > > {code} > import org.apache.spark.SparkConf > import org.apache.spark.sql.SparkSession > import org.apache.spark.sql.types.DataTypes._ > import org.apache.spark.sql.types.{StructField, StructType} > val sc = SparkSession.builder().config(new > SparkConf().setMaster("local")).getOrCreate() > val jsonRDD = sc.sparkContext.parallelize(Seq("""{"a":1}""")) > sc.read > .schema(StructType(Seq(StructField("a", IntegerType > .json(jsonRDD) > .write > .parquet("/tmp/test") > sc.read > .schema(StructType(Seq(StructField("a", IntegerType), StructField("b", > IntegerType, nullable = true > .load("/tmp/test") > .createOrReplaceTempView("table") > sc.sql("select b from table where b is not null").show() > {code} > returns: > {code} > 16/11/22 17:43:47 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1) > java.lang.IllegalArgumentException: Column [b] was not found in schema! > at org.apache.parquet.Preconditions.checkArgument(Preconditions.java:55) > at > org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:190) > at > org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:178) > at > org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:160) > at > org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:100) > at > org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:59) > at > org.apache.parquet.filter2.predicate.Operators$NotEq.accept(Operators.java:194) > at > org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validate(SchemaCompatibilityValidator.java:64) > at > org.apache.parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:59) > at > org.apache.parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:40) > at > org.apache.parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:126) > at > org.apache.parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:46) > at > org.apache.spark.sql.execution.datasources.parquet.SpecificParquetRecordReaderBase.initialize(SpecificParquetRecordReaderBase.java:110) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initialize(VectorizedParquetRecordReader.java:109) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReader$1.apply(ParquetFileFormat.scala:367) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReader$1.apply(ParquetFileFormat.scala:341) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:116) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.scan_nextBatch$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at
[jira] [Commented] (SPARK-19409) Upgrade Parquet to 1.8.2
[ https://issues.apache.org/jira/browse/SPARK-19409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15851959#comment-15851959 ] Apache Spark commented on SPARK-19409: -- User 'liancheng' has created a pull request for this issue: https://github.com/apache/spark/pull/16791 > Upgrade Parquet to 1.8.2 > > > Key: SPARK-19409 > URL: https://issues.apache.org/jira/browse/SPARK-19409 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.1.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun > Fix For: 2.2.0 > > > Apache Parquet 1.8.2 is released officially last week on 26 Jan. > This issue aims to bump Parquet version to 1.8.2 since it includes many fixes. > https://lists.apache.org/thread.html/af0c813f1419899289a336d96ec02b3bbeecaea23aa6ef69f435c142@%3Cdev.parquet.apache.org%3E -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17213) Parquet String Pushdown for Non-Eq Comparisons Broken
[ https://issues.apache.org/jira/browse/SPARK-17213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15851960#comment-15851960 ] Apache Spark commented on SPARK-17213: -- User 'liancheng' has created a pull request for this issue: https://github.com/apache/spark/pull/16791 > Parquet String Pushdown for Non-Eq Comparisons Broken > - > > Key: SPARK-17213 > URL: https://issues.apache.org/jira/browse/SPARK-17213 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Andrew Duffy >Assignee: Cheng Lian > Fix For: 2.1.0 > > > Spark defines ordering over strings based on comparison of UTF8 byte arrays, > which compare bytes as unsigned integers. Currently however Parquet does not > respect this ordering. This is currently in the process of being fixed in > Parquet, JIRA and PR link below, but currently all filters are broken over > strings, with there actually being a correctness issue for {{>}} and {{<}}. > *Repro:* > Querying directly from in-memory DataFrame: > {code} > > Seq("a", "é").toDF("name").where("name > 'a'").count > 1 > {code} > Querying from a parquet dataset: > {code} > > Seq("a", "é").toDF("name").write.parquet("/tmp/bad") > > spark.read.parquet("/tmp/bad").where("name > 'a'").count > 0 > {code} > This happens because Spark sorts the rows to be {{[a, é]}}, but Parquet's > implementation of comparison of strings is based on signed byte array > comparison, so it will actually create 1 row group with statistics > {{min=é,max=a}}, and so the row group will be dropped by the query. > Based on the way Parquet pushes down Eq, it will not be affecting correctness > but it will force you to read row groups you should be able to skip. > Link to PARQUET issue: https://issues.apache.org/jira/browse/PARQUET-686 > Link to PR: https://github.com/apache/parquet-mr/pull/362 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19451) Long values in Window function
[ https://issues.apache.org/jira/browse/SPARK-19451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Champ updated SPARK-19451: - Description: Hi there, there seems to be a major limitation in spark window functions and rangeBetween method. If I have the following code : {code:title=Exemple |borderStyle=solid} val tw = Window.orderBy("date") .partitionBy("id") .rangeBetween( from , 0) {code} Everything seems ok, while *from* value is not too large... Even if the rangeBetween() method supports Long parameters. But If i set *-216000L* value to *from* it does not work ! It is probably related to this part of code in the between() method, of the WindowSpec class, called by rangeBetween() {code:title=between() method|borderStyle=solid} val boundaryStart = start match { case 0 => CurrentRow case Long.MinValue => UnboundedPreceding case x if x < 0 => ValuePreceding(-start.toInt) case x if x > 0 => ValueFollowing(start.toInt) } {code} ( look at this *.toInt* ) Does anybody know it there's a way to solve / patch this behavior ? Any help will be appreciated Thx was: Hi there, there seems to be a major limitation in spark window functions and rangeBetween method. If I have the following code : ``` val tw = Window.orderBy("date") .partitionBy("id") .rangeBetween( from , 0) ``` Everything seems ok, while "from" value is not too large... Even if the rangeBetween() method supports Long parameters. But If i set "-216000L" value to "from" it does not work ! It is probably related to this part of code in the between() method, of the WindowSpec class, called by rangeBetween() ``` val boundaryStart = start match { case 0 => CurrentRow case Long.MinValue => UnboundedPreceding case x if x < 0 => ValuePreceding(-start.toInt) case x if x > 0 => ValueFollowing(start.toInt) } ``` ( look at this " .toInt " ) Does anybody know it there's a way to solve / patch this behavior ? Any help will be appreciated Thx > Long values in Window function > -- > > Key: SPARK-19451 > URL: https://issues.apache.org/jira/browse/SPARK-19451 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.1, 2.0.2 >Reporter: Julien Champ > > Hi there, > there seems to be a major limitation in spark window functions and > rangeBetween method. > If I have the following code : > {code:title=Exemple |borderStyle=solid} > val tw = Window.orderBy("date") > .partitionBy("id") > .rangeBetween( from , 0) > {code} > Everything seems ok, while *from* value is not too large... Even if the > rangeBetween() method supports Long parameters. > But If i set *-216000L* value to *from* it does not work ! > It is probably related to this part of code in the between() method, of the > WindowSpec class, called by rangeBetween() > {code:title=between() method|borderStyle=solid} > val boundaryStart = start match { > case 0 => CurrentRow > case Long.MinValue => UnboundedPreceding > case x if x < 0 => ValuePreceding(-start.toInt) > case x if x > 0 => ValueFollowing(start.toInt) > } > {code} > ( look at this *.toInt* ) > Does anybody know it there's a way to solve / patch this behavior ? > Any help will be appreciated > Thx -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19233) Inconsistent Behaviour of Spark Streaming Checkpoint
[ https://issues.apache.org/jira/browse/SPARK-19233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15851787#comment-15851787 ] Nan Zhu commented on SPARK-19233: - ping > Inconsistent Behaviour of Spark Streaming Checkpoint > > > Key: SPARK-19233 > URL: https://issues.apache.org/jira/browse/SPARK-19233 > Project: Spark > Issue Type: Improvement > Components: DStreams >Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0 >Reporter: Nan Zhu > > When checking one of our application logs, we found the following behavior > (simplified) > 1. Spark application recovers from checkpoint constructed at timestamp 1000ms > 2. The log shows that Spark application can recover RDDs generated at > timestamp 2000, 3000 > The root cause is that generateJobs event is pushed to the queue by a > separate thread (RecurTimer), before doCheckpoint event is pushed to the > queue, there might have been multiple generatedJobs being processed. As a > result, when doCheckpoint for timestamp 1000 is processed, the generatedRDDs > data structure containing RDDs generated at 2000, 3000 is serialized as part > of checkpoint of 1000. > It brings overhead for debugging and coordinate our offset management > strategy with Spark Streaming's checkpoint strategy when we are developing a > new type of DStream which integrates Spark Streaming with an internal message > middleware. > The proposed fix is to filter generatedRDDs according to checkpoint timestamp > when serializing it. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19450) Replace askWithRetry with askSync.
[ https://issues.apache.org/jira/browse/SPARK-19450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19450: Assignee: (was: Apache Spark) > Replace askWithRetry with askSync. > -- > > Key: SPARK-19450 > URL: https://issues.apache.org/jira/browse/SPARK-19450 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: jin xing > > *askSync* is already added in *RpcEndpointRef* (see SPARK-19347 and > https://github.com/apache/spark/pull/16690#issuecomment-276850068) and > *askWithRetry* is marked as deprecated. > As mentioned > SPARK-18113(https://github.com/apache/spark/pull/16503#event-927953218): > ??askWithRetry is basically an unneeded API, and a leftover from the akka > days that doesn't make sense anymore. It's prone to cause deadlocks (exactly > because it's blocking), it imposes restrictions on the caller (e.g. > idempotency) and other things that people generally don't pay that much > attention to when using it.?? > Since *askWithRetry* is just used inside spark and not in user logic. It > might make sense to replace all of them with *askSync*. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19450) Replace askWithRetry with askSync.
[ https://issues.apache.org/jira/browse/SPARK-19450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19450: Assignee: Apache Spark > Replace askWithRetry with askSync. > -- > > Key: SPARK-19450 > URL: https://issues.apache.org/jira/browse/SPARK-19450 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: jin xing >Assignee: Apache Spark > > *askSync* is already added in *RpcEndpointRef* (see SPARK-19347 and > https://github.com/apache/spark/pull/16690#issuecomment-276850068) and > *askWithRetry* is marked as deprecated. > As mentioned > SPARK-18113(https://github.com/apache/spark/pull/16503#event-927953218): > ??askWithRetry is basically an unneeded API, and a leftover from the akka > days that doesn't make sense anymore. It's prone to cause deadlocks (exactly > because it's blocking), it imposes restrictions on the caller (e.g. > idempotency) and other things that people generally don't pay that much > attention to when using it.?? > Since *askWithRetry* is just used inside spark and not in user logic. It > might make sense to replace all of them with *askSync*. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19280) Failed Recovery from checkpoint caused by the multi-threads issue in Spark Streaming scheduler
[ https://issues.apache.org/jira/browse/SPARK-19280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15851786#comment-15851786 ] Nan Zhu commented on SPARK-19280: - ping > Failed Recovery from checkpoint caused by the multi-threads issue in Spark > Streaming scheduler > -- > > Key: SPARK-19280 > URL: https://issues.apache.org/jira/browse/SPARK-19280 > Project: Spark > Issue Type: Bug >Affects Versions: 1.6.3, 2.0.0, 2.0.1, 2.0.2, 2.1.0 >Reporter: Nan Zhu >Priority: Critical > > In one of our applications, we found the following issue, the application > recovering from a checkpoint file named "checkpoint-***16670" but with > the timestamp ***16650 will recover from the very beginning of the stream > and because our application relies on the external & periodically-cleaned > data (syncing with checkpoint cleanup), the recovery just failed > We identified a potential issue in Spark Streaming checkpoint and will > describe it with the following example. We will propose a fix in the end of > this JIRA. > 1. The application properties: Batch Duration: 2, Functionality: Single > Stream calling ReduceByKeyAndWindow and print, Window Size: 6, > SlideDuration, 2 > 2. RDD at 16650 is generated and the corresponding job is submitted to > the execution ThreadPool. Meanwhile, a DoCheckpoint message is sent to the > queue of JobGenerator > 3. Job at 16650 is finished and JobCompleted message is sent to > JobScheduler's queue, and meanwhile, Job at 16652 is submitted to the > execution ThreadPool and similarly, a DoCheckpoint is sent to the queue of > JobGenerator > 4. JobScheduler's message processing thread (I will use JS-EventLoop to > identify it) is not scheduled by the operating system for a long time, and > during this period, Jobs generated from 16652 - 16670 are generated > and completed. > 5. the message processing thread of JobGenerator (JG-EventLoop) is scheduled > and processed all DoCheckpoint messages for jobs ranging from 16652 - > 16670 and checkpoint files are successfully written. CRITICAL: at this > moment, the lastCheckpointTime would be 16670. > 6. JS-EventLoop is scheduled, and process all JobCompleted messages for jobs > ranging from 16652 - 16670. CRITICAL: a ClearMetadata message is > pushed to JobGenerator's message queue for EACH JobCompleted. > 7. The current message queue contains 20 ClearMetadata messages and > JG-EventLoop is scheduled to process them. CRITICAL: ClearMetadata will > remove all RDDs out of rememberDuration window. In our case, > ReduceyKeyAndWindow will set rememberDuration to 10 (rememberDuration of > ReducedWindowDStream (4) + windowSize) resulting that only RDDs <- > (16660, 16670] are kept. And ClearMetaData processing logic will push > a DoCheckpoint to JobGenerator's thread > 8. JG-EventLoop is scheduled again to process DoCheckpoint for 1665, VERY > CRITICAL: at this step, RDD no later than 16660 has been removed, and > checkpoint data is updated as > https://github.com/apache/spark/blob/a81e336f1eddc2c6245d807aae2c81ddc60eabf9/streaming/src/main/scala/org/apache/spark/streaming/dstream/DStreamCheckpointData.scala#L53 > and > https://github.com/apache/spark/blob/a81e336f1eddc2c6245d807aae2c81ddc60eabf9/streaming/src/main/scala/org/apache/spark/streaming/dstream/DStreamCheckpointData.scala#L59. > 9. After 8, a checkpoint named /path/checkpoint-16670 is created but with > the timestamp 16650. and at this moment, Application crashed > 10. Application recovers from /path/checkpoint-16670 and try to get RDD > with validTime 16650. Of course it will not find it and has to recompute. > In the case of ReduceByKeyAndWindow, it needs to recursively slice RDDs until > to the start of the stream. When the stream depends on the external data, it > will not successfully recover. In the case of Kafka, the recovered RDDs would > not be the same as the original one, as the currentOffsets has been updated > to the value at the moment of 16670 > The proposed fix: > 0. a hot-fix would be setting timestamp Checkpoint File to lastCheckpointTime > instead of using the timestamp of Checkpoint instance (any side-effect?) > 1. ClearMetadata shall be ClearMedataAndCheckpoint > 2. a long-term fix would be merge JobScheduler and JobGenerator, I didn't see > any necessary to have two threads here -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19450) Replace askWithRetry with askSync.
[ https://issues.apache.org/jira/browse/SPARK-19450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15851785#comment-15851785 ] Apache Spark commented on SPARK-19450: -- User 'jinxing64' has created a pull request for this issue: https://github.com/apache/spark/pull/16790 > Replace askWithRetry with askSync. > -- > > Key: SPARK-19450 > URL: https://issues.apache.org/jira/browse/SPARK-19450 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: jin xing > > *askSync* is already added in *RpcEndpointRef* (see SPARK-19347 and > https://github.com/apache/spark/pull/16690#issuecomment-276850068) and > *askWithRetry* is marked as deprecated. > As mentioned > SPARK-18113(https://github.com/apache/spark/pull/16503#event-927953218): > ??askWithRetry is basically an unneeded API, and a leftover from the akka > days that doesn't make sense anymore. It's prone to cause deadlocks (exactly > because it's blocking), it imposes restrictions on the caller (e.g. > idempotency) and other things that people generally don't pay that much > attention to when using it.?? > Since *askWithRetry* is just used inside spark and not in user logic. It > might make sense to replace all of them with *askSync*. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-19428) Ability to select first row of groupby
[ https://issues.apache.org/jira/browse/SPARK-19428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15851784#comment-15851784 ] Luke Miner edited comment on SPARK-19428 at 2/3/17 5:33 PM: Couple of things. Sometimes I just want a random row from each group. However, if sorting is allowed, then it is useful to get a single row from each group based on the maximum or minimum of some column. Even more useful would be something like {{df.groupby('group').orderBy('foo').limit(10)}} Then it would be easy to get the top 10 observations for each group based on some criteria. was (Author: lminer): Couple of things. Sometimes I just want a random row from each group. However, if sorting is allowed, then it is useful to get a single row from each group based on the maximum or minimum of some column. Even more useful would be something like {{df.groupby('group').orderBy('foo').limit(n)}} Then it would be easy to get the top {{n}} observations for each group based on some criteria. > Ability to select first row of groupby > -- > > Key: SPARK-19428 > URL: https://issues.apache.org/jira/browse/SPARK-19428 > Project: Spark > Issue Type: Brainstorming > Components: SQL >Affects Versions: 2.1.0 >Reporter: Luke Miner >Priority: Minor > > It would be nice to be able to select the first row from {{GroupedData}}. > Pandas has something like this: > {{df.groupby('group').first()}} > It's especially handy if you can order the group as well. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19428) Ability to select first row of groupby
[ https://issues.apache.org/jira/browse/SPARK-19428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15851784#comment-15851784 ] Luke Miner commented on SPARK-19428: Couple of things. Sometimes I just want a random row from each group. However, if sorting is allowed, then it is useful to get a single row from each group based on the maximum or minimum of some column. Even more useful would be something like {{df.groupby('group').orderBy('foo').limit(n)}} Then it would be easy to get the top {{n}} observations for each group based on some criteria. > Ability to select first row of groupby > -- > > Key: SPARK-19428 > URL: https://issues.apache.org/jira/browse/SPARK-19428 > Project: Spark > Issue Type: Brainstorming > Components: SQL >Affects Versions: 2.1.0 >Reporter: Luke Miner >Priority: Minor > > It would be nice to be able to select the first row from {{GroupedData}}. > Pandas has something like this: > {{df.groupby('group').first()}} > It's especially handy if you can order the group as well. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19451) Long values in Window function
Julien Champ created SPARK-19451: Summary: Long values in Window function Key: SPARK-19451 URL: https://issues.apache.org/jira/browse/SPARK-19451 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.0.2, 1.6.1 Reporter: Julien Champ Hi there, there seems to be a major limitation in spark window functions and rangeBetween method. If I have the following code : ``` val tw = Window.orderBy("date") .partitionBy("id") .rangeBetween( from , 0) ``` Everything seems ok, while "from" value is not too large... Even if the rangeBetween() method supports Long parameters. But If i set "-216000L" value to "from" it does not work ! It is probably related to this part of code in the between() method, of the WindowSpec class, called by rangeBetween() ``` val boundaryStart = start match { case 0 => CurrentRow case Long.MinValue => UnboundedPreceding case x if x < 0 => ValuePreceding(-start.toInt) case x if x > 0 => ValueFollowing(start.toInt) } ``` ( look at this " .toInt " ) Does anybody know it there's a way to solve / patch this behavior ? Any help will be appreciated Thx -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19450) Replace askWithRetry with askSync.
[ https://issues.apache.org/jira/browse/SPARK-19450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jin xing updated SPARK-19450: - Description: *askSync* is already added in *RpcEndpointRef* (see SPARK-19347 and https://github.com/apache/spark/pull/16690#issuecomment-276850068) and *askWithRetry* is marked as deprecated. As mentioned SPARK-18113(https://github.com/apache/spark/pull/16503#event-927953218): ??askWithRetry is basically an unneeded API, and a leftover from the akka days that doesn't make sense anymore. It's prone to cause deadlocks (exactly because it's blocking), it imposes restrictions on the caller (e.g. idempotency) and other things that people generally don't pay that much attention to when using it.?? Since *askWithRetry* is just used inside spark and not in user logic. It might make sense to replace all of them with *askSync*. was: *askSync* is already added in *RpcEndpointRef* (see SPARK-19347 and https://github.com/apache/spark/pull/16690#issuecomment-276850068) and *askWithRetry* is marked as deprecated. As mentioned SPARK-18113(https://github.com/apache/spark/pull/16503#event-927953218): ??askWithRetry is that it's basically an unneeded API, and a leftover from the akka days that doesn't make sense anymore. It's prone to cause deadlocks (exactly because it's blocking), it imposes restrictions on the caller (e.g. idempotency) and other things that people generally don't pay that much attention to when using it.?? Since *askWithRetry* is just used inside spark and not in user logic. It might make sense to replace all of them with *askSync*. > Replace askWithRetry with askSync. > -- > > Key: SPARK-19450 > URL: https://issues.apache.org/jira/browse/SPARK-19450 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: jin xing > > *askSync* is already added in *RpcEndpointRef* (see SPARK-19347 and > https://github.com/apache/spark/pull/16690#issuecomment-276850068) and > *askWithRetry* is marked as deprecated. > As mentioned > SPARK-18113(https://github.com/apache/spark/pull/16503#event-927953218): > ??askWithRetry is basically an unneeded API, and a leftover from the akka > days that doesn't make sense anymore. It's prone to cause deadlocks (exactly > because it's blocking), it imposes restrictions on the caller (e.g. > idempotency) and other things that people generally don't pay that much > attention to when using it.?? > Since *askWithRetry* is just used inside spark and not in user logic. It > might make sense to replace all of them with *askSync*. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19450) Replace askWithRetry with askSync.
jin xing created SPARK-19450: Summary: Replace askWithRetry with askSync. Key: SPARK-19450 URL: https://issues.apache.org/jira/browse/SPARK-19450 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.1.0 Reporter: jin xing *askSync* is already added in *RpcEndpointRef* (see SPARK-19347 and https://github.com/apache/spark/pull/16690#issuecomment-276850068) and *askWithRetry* is marked as deprecated. As mentioned SPARK-18113(https://github.com/apache/spark/pull/16503#event-927953218): ??askWithRetry is that it's basically an unneeded API, and a leftover from the akka days that doesn't make sense anymore. It's prone to cause deadlocks (exactly because it's blocking), it imposes restrictions on the caller (e.g. idempotency) and other things that people generally don't pay that much attention to when using it.?? Since *askWithRetry* is just used inside spark and not in user logic. It might make sense to replace all of them with *askSync*. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19448) unify some duplication function in MetaStoreRelation
[ https://issues.apache.org/jira/browse/SPARK-19448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15851631#comment-15851631 ] Apache Spark commented on SPARK-19448: -- User 'windpiger' has created a pull request for this issue: https://github.com/apache/spark/pull/16787 > unify some duplication function in MetaStoreRelation > > > Key: SPARK-19448 > URL: https://issues.apache.org/jira/browse/SPARK-19448 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Song Jun >Priority: Minor > > 1. MetaStoreRelation' hiveQlTable can be replaced by calling HiveClientImpl's > toHiveTable > 2. MetaStoreRelation's toHiveColumn can be replaced by calling > HiveClientImpl's toHiveColumn > 3. process another TODO > https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/MetastoreRelation.scala#L234 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19448) unify some duplication function in MetaStoreRelation
[ https://issues.apache.org/jira/browse/SPARK-19448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19448: Assignee: (was: Apache Spark) > unify some duplication function in MetaStoreRelation > > > Key: SPARK-19448 > URL: https://issues.apache.org/jira/browse/SPARK-19448 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Song Jun >Priority: Minor > > 1. MetaStoreRelation' hiveQlTable can be replaced by calling HiveClientImpl's > toHiveTable > 2. MetaStoreRelation's toHiveColumn can be replaced by calling > HiveClientImpl's toHiveColumn > 3. process another TODO > https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/MetastoreRelation.scala#L234 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19448) unify some duplication function in MetaStoreRelation
[ https://issues.apache.org/jira/browse/SPARK-19448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19448: Assignee: Apache Spark > unify some duplication function in MetaStoreRelation > > > Key: SPARK-19448 > URL: https://issues.apache.org/jira/browse/SPARK-19448 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Song Jun >Assignee: Apache Spark >Priority: Minor > > 1. MetaStoreRelation' hiveQlTable can be replaced by calling HiveClientImpl's > toHiveTable > 2. MetaStoreRelation's toHiveColumn can be replaced by calling > HiveClientImpl's toHiveColumn > 3. process another TODO > https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/MetastoreRelation.scala#L234 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19449) Inconsistent results between ml package RandomForestClassificationModel and mllib package RandomForestModel
[ https://issues.apache.org/jira/browse/SPARK-19449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15851593#comment-15851593 ] Sean Owen commented on SPARK-19449: --- I don't think this can be made fully deterministic even when setting seeds in certain places. If the results are quite similar and sensible I don't think this is a problem > Inconsistent results between ml package RandomForestClassificationModel and > mllib package RandomForestModel > --- > > Key: SPARK-19449 > URL: https://issues.apache.org/jira/browse/SPARK-19449 > Project: Spark > Issue Type: Bug > Components: ML, MLlib >Affects Versions: 2.1.0 >Reporter: Aseem Bansal > > I worked on some code to convert ml package RandomForestClassificationModel > to mllib package RandomForestModel. It was needed because we need to make > predictions on the order of ms. I found that the results are inconsistent > although the underlying DecisionTreeModel are exactly the same. So the > behavior between the 2 implementations is inconsistent which should not be > the case. > The below code can be used to reproduce the issue. Can run this as a simple > Java app as long as you have spark dependencies set up properly. > {noformat} > import org.apache.spark.ml.Transformer; > import org.apache.spark.ml.classification.*; > import org.apache.spark.ml.linalg.*; > import org.apache.spark.ml.regression.RandomForestRegressionModel; > import org.apache.spark.mllib.linalg.DenseVector; > import org.apache.spark.mllib.linalg.Vector; > import org.apache.spark.mllib.tree.configuration.Algo; > import org.apache.spark.mllib.tree.model.DecisionTreeModel; > import org.apache.spark.mllib.tree.model.RandomForestModel; > import org.apache.spark.sql.Dataset; > import org.apache.spark.sql.Row; > import org.apache.spark.sql.RowFactory; > import org.apache.spark.sql.SparkSession; > import org.apache.spark.sql.types.DataTypes; > import org.apache.spark.sql.types.Metadata; > import org.apache.spark.sql.types.StructField; > import org.apache.spark.sql.types.StructType; > import scala.Enumeration; > import java.util.ArrayList; > import java.util.List; > import java.util.Random; > abstract class Predictor { > abstract double predict(Vector vector); > } > public class MainConvertModels { > public static final int seed = 42; > public static void main(String[] args) { > int numRows = 1000; > int numFeatures = 3; > int numClasses = 2; > double trainFraction = 0.8; > double testFraction = 0.2; > SparkSession spark = SparkSession.builder() > .appName("conversion app") > .master("local") > .getOrCreate(); > Dataset data = getDummyData(spark, numRows, numFeatures, > numClasses); > Dataset[] splits = data.randomSplit(new double[]{trainFraction, > testFraction}, seed); > Dataset trainingData = splits[0]; > Dataset testData = splits[1]; > testData.cache(); > List labels = getLabels(testData); > List features = getFeatures(testData); > DecisionTreeClassifier classifier1 = new DecisionTreeClassifier(); > DecisionTreeClassificationModel model1 = > classifier1.fit(trainingData); > final DecisionTreeModel convertedModel1 = > convertDecisionTreeModel(model1, Algo.Classification()); > RandomForestClassifier classifier = new RandomForestClassifier(); > RandomForestClassificationModel model2 = classifier.fit(trainingData); > final RandomForestModel convertedModel2 = > convertRandomForestModel(model2); > System.out.println( > "** DecisionTreeClassifier\n" + > "** Original **" + getInfo(model1, testData) + "\n" + > "** New **" + getInfo(new Predictor() { > double predict(Vector vector) {return > convertedModel1.predict(vector);} > }, labels, features) + "\n" + > "\n" + > "** RandomForestClassifier\n" + > "** Original **" + getInfo(model2, testData) + "\n" + > "** New **" + getInfo(new Predictor() {double > predict(Vector vector) {return convertedModel2.predict(vector);}}, labels, > features) + "\n" + > "\n" + > ""); > } > static Dataset getDummyData(SparkSession spark, int numberRows, int > numberFeatures, int labelUpperBound) { > StructType schema = new StructType(new StructField[]{ > new StructField("label", DataTypes.DoubleType, false, > Metadata.empty()), > new StructField("features", new VectorUDT(), false, > Metadata.empty()) >
[jira] [Assigned] (SPARK-19444) Tokenizer example does not compile without extra imports
[ https://issues.apache.org/jira/browse/SPARK-19444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19444: Assignee: Apache Spark > Tokenizer example does not compile without extra imports > > > Key: SPARK-19444 > URL: https://issues.apache.org/jira/browse/SPARK-19444 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 2.1.0 >Reporter: Aseem Bansal >Assignee: Apache Spark >Priority: Minor > > The example at http://spark.apache.org/docs/2.1.0/ml-features.html#tokenizer > does not compile without the following static import > import static org.apache.spark.sql.functions.*; -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19444) Tokenizer example does not compile without extra imports
[ https://issues.apache.org/jira/browse/SPARK-19444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15851589#comment-15851589 ] Apache Spark commented on SPARK-19444: -- User 'anshbansal' has created a pull request for this issue: https://github.com/apache/spark/pull/16789 > Tokenizer example does not compile without extra imports > > > Key: SPARK-19444 > URL: https://issues.apache.org/jira/browse/SPARK-19444 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 2.1.0 >Reporter: Aseem Bansal >Priority: Minor > > The example at http://spark.apache.org/docs/2.1.0/ml-features.html#tokenizer > does not compile without the following static import > import static org.apache.spark.sql.functions.*; -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19444) Tokenizer example does not compile without extra imports
[ https://issues.apache.org/jira/browse/SPARK-19444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15851588#comment-15851588 ] Aseem Bansal commented on SPARK-19444: -- https://github.com/apache/spark/pull/16789 > Tokenizer example does not compile without extra imports > > > Key: SPARK-19444 > URL: https://issues.apache.org/jira/browse/SPARK-19444 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 2.1.0 >Reporter: Aseem Bansal >Priority: Minor > > The example at http://spark.apache.org/docs/2.1.0/ml-features.html#tokenizer > does not compile without the following static import > import static org.apache.spark.sql.functions.*; -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19444) Tokenizer example does not compile without extra imports
[ https://issues.apache.org/jira/browse/SPARK-19444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19444: Assignee: (was: Apache Spark) > Tokenizer example does not compile without extra imports > > > Key: SPARK-19444 > URL: https://issues.apache.org/jira/browse/SPARK-19444 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 2.1.0 >Reporter: Aseem Bansal >Priority: Minor > > The example at http://spark.apache.org/docs/2.1.0/ml-features.html#tokenizer > does not compile without the following static import > import static org.apache.spark.sql.functions.*; -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19449) Inconsistent results between ml package RandomForestClassificationModel and mllib package RandomForestModel
[ https://issues.apache.org/jira/browse/SPARK-19449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15851576#comment-15851576 ] Aseem Bansal commented on SPARK-19449: -- Isn't the decision tree debug string print it as a series of IF-ELSE? I printed the debug string for the 2 random forest models and it was exactly the same. In other words the 2 implementations should be mathematically equivalent. The random processes for selecting data should not cause any issues as I ensured that the exact same data is going to both versions. It works for decision trees and random forest classifier is just majority vote of bunch of decision trees classifiers so I cannot see how that could be different. > Inconsistent results between ml package RandomForestClassificationModel and > mllib package RandomForestModel > --- > > Key: SPARK-19449 > URL: https://issues.apache.org/jira/browse/SPARK-19449 > Project: Spark > Issue Type: Bug > Components: ML, MLlib >Affects Versions: 2.1.0 >Reporter: Aseem Bansal > > I worked on some code to convert ml package RandomForestClassificationModel > to mllib package RandomForestModel. It was needed because we need to make > predictions on the order of ms. I found that the results are inconsistent > although the underlying DecisionTreeModel are exactly the same. So the > behavior between the 2 implementations is inconsistent which should not be > the case. > The below code can be used to reproduce the issue. Can run this as a simple > Java app as long as you have spark dependencies set up properly. > {noformat} > import org.apache.spark.ml.Transformer; > import org.apache.spark.ml.classification.*; > import org.apache.spark.ml.linalg.*; > import org.apache.spark.ml.regression.RandomForestRegressionModel; > import org.apache.spark.mllib.linalg.DenseVector; > import org.apache.spark.mllib.linalg.Vector; > import org.apache.spark.mllib.tree.configuration.Algo; > import org.apache.spark.mllib.tree.model.DecisionTreeModel; > import org.apache.spark.mllib.tree.model.RandomForestModel; > import org.apache.spark.sql.Dataset; > import org.apache.spark.sql.Row; > import org.apache.spark.sql.RowFactory; > import org.apache.spark.sql.SparkSession; > import org.apache.spark.sql.types.DataTypes; > import org.apache.spark.sql.types.Metadata; > import org.apache.spark.sql.types.StructField; > import org.apache.spark.sql.types.StructType; > import scala.Enumeration; > import java.util.ArrayList; > import java.util.List; > import java.util.Random; > abstract class Predictor { > abstract double predict(Vector vector); > } > public class MainConvertModels { > public static final int seed = 42; > public static void main(String[] args) { > int numRows = 1000; > int numFeatures = 3; > int numClasses = 2; > double trainFraction = 0.8; > double testFraction = 0.2; > SparkSession spark = SparkSession.builder() > .appName("conversion app") > .master("local") > .getOrCreate(); > Dataset data = getDummyData(spark, numRows, numFeatures, > numClasses); > Dataset[] splits = data.randomSplit(new double[]{trainFraction, > testFraction}, seed); > Dataset trainingData = splits[0]; > Dataset testData = splits[1]; > testData.cache(); > List labels = getLabels(testData); > List features = getFeatures(testData); > DecisionTreeClassifier classifier1 = new DecisionTreeClassifier(); > DecisionTreeClassificationModel model1 = > classifier1.fit(trainingData); > final DecisionTreeModel convertedModel1 = > convertDecisionTreeModel(model1, Algo.Classification()); > RandomForestClassifier classifier = new RandomForestClassifier(); > RandomForestClassificationModel model2 = classifier.fit(trainingData); > final RandomForestModel convertedModel2 = > convertRandomForestModel(model2); > System.out.println( > "** DecisionTreeClassifier\n" + > "** Original **" + getInfo(model1, testData) + "\n" + > "** New **" + getInfo(new Predictor() { > double predict(Vector vector) {return > convertedModel1.predict(vector);} > }, labels, features) + "\n" + > "\n" + > "** RandomForestClassifier\n" + > "** Original **" + getInfo(model2, testData) + "\n" + > "** New **" + getInfo(new Predictor() {double > predict(Vector vector) {return convertedModel2.predict(vector);}}, labels, > features) + "\n" + > "\n" + > ""); > } >
[jira] [Commented] (SPARK-19449) Inconsistent results between ml package RandomForestClassificationModel and mllib package RandomForestModel
[ https://issues.apache.org/jira/browse/SPARK-19449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15851568#comment-15851568 ] Aseem Bansal commented on SPARK-19449: -- [~srowen] I removed some extra code. The part where I did the conversion is at the end in convertRandomForestModel method. Basically the above code does this - Prepare 1000 rows of data with 3 features randomly. Prepare 1000 labels randomly. I am not working on creating the model but the conversion. So having random data is not an issue. It will just be a horrible model. - Split the data in 80/20 ratio for training/test - train ml version of decision tree model and random forest model using the training set. Let's call them DT1 and RF1 - convert these to mllib version of the models. Let's call them DT2 and RF2 - Use the test set to predict labels using DT1, DT2, RF1, RF2. - Compare predicted labels DT1 with DT2. Same results - Compare predicted labels RF1 with RF2. Different results. There should not be any random results here as I have used seeds for random number generators everywhere and then used the exact same data for doing predictions using all 4 models. > Inconsistent results between ml package RandomForestClassificationModel and > mllib package RandomForestModel > --- > > Key: SPARK-19449 > URL: https://issues.apache.org/jira/browse/SPARK-19449 > Project: Spark > Issue Type: Bug > Components: ML, MLlib >Affects Versions: 2.1.0 >Reporter: Aseem Bansal > > I worked on some code to convert ml package RandomForestClassificationModel > to mllib package RandomForestModel. It was needed because we need to make > predictions on the order of ms. I found that the results are inconsistent > although the underlying DecisionTreeModel are exactly the same. So the > behavior between the 2 implementations is inconsistent which should not be > the case. > The below code can be used to reproduce the issue. Can run this as a simple > Java app as long as you have spark dependencies set up properly. > {noformat} > import org.apache.spark.ml.Transformer; > import org.apache.spark.ml.classification.*; > import org.apache.spark.ml.linalg.*; > import org.apache.spark.ml.regression.RandomForestRegressionModel; > import org.apache.spark.mllib.linalg.DenseVector; > import org.apache.spark.mllib.linalg.Vector; > import org.apache.spark.mllib.tree.configuration.Algo; > import org.apache.spark.mllib.tree.model.DecisionTreeModel; > import org.apache.spark.mllib.tree.model.RandomForestModel; > import org.apache.spark.sql.Dataset; > import org.apache.spark.sql.Row; > import org.apache.spark.sql.RowFactory; > import org.apache.spark.sql.SparkSession; > import org.apache.spark.sql.types.DataTypes; > import org.apache.spark.sql.types.Metadata; > import org.apache.spark.sql.types.StructField; > import org.apache.spark.sql.types.StructType; > import scala.Enumeration; > import java.util.ArrayList; > import java.util.List; > import java.util.Random; > abstract class Predictor { > abstract double predict(Vector vector); > } > public class MainConvertModels { > public static final int seed = 42; > public static void main(String[] args) { > int numRows = 1000; > int numFeatures = 3; > int numClasses = 2; > double trainFraction = 0.8; > double testFraction = 0.2; > SparkSession spark = SparkSession.builder() > .appName("conversion app") > .master("local") > .getOrCreate(); > Dataset data = getDummyData(spark, numRows, numFeatures, > numClasses); > Dataset[] splits = data.randomSplit(new double[]{trainFraction, > testFraction}, seed); > Dataset trainingData = splits[0]; > Dataset testData = splits[1]; > testData.cache(); > List labels = getLabels(testData); > List features = getFeatures(testData); > DecisionTreeClassifier classifier1 = new DecisionTreeClassifier(); > DecisionTreeClassificationModel model1 = > classifier1.fit(trainingData); > final DecisionTreeModel convertedModel1 = > convertDecisionTreeModel(model1, Algo.Classification()); > RandomForestClassifier classifier = new RandomForestClassifier(); > RandomForestClassificationModel model2 = classifier.fit(trainingData); > final RandomForestModel convertedModel2 = > convertRandomForestModel(model2); > System.out.println( > "** DecisionTreeClassifier\n" + > "** Original **" + getInfo(model1, testData) + "\n" + > "** New **" + getInfo(new Predictor() { > double predict(Vector vector) {return > convertedModel1.predict(vector);} >
[jira] [Updated] (SPARK-19449) Inconsistent results between ml package RandomForestClassificationModel and mllib package RandomForestModel
[ https://issues.apache.org/jira/browse/SPARK-19449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aseem Bansal updated SPARK-19449: - Description: I worked on some code to convert ml package RandomForestClassificationModel to mllib package RandomForestModel. It was needed because we need to make predictions on the order of ms. I found that the results are inconsistent although the underlying DecisionTreeModel are exactly the same. So the behavior between the 2 implementations is inconsistent which should not be the case. The below code can be used to reproduce the issue. Can run this as a simple Java app as long as you have spark dependencies set up properly. {noformat} import org.apache.spark.ml.Transformer; import org.apache.spark.ml.classification.*; import org.apache.spark.ml.linalg.*; import org.apache.spark.ml.regression.RandomForestRegressionModel; import org.apache.spark.mllib.linalg.DenseVector; import org.apache.spark.mllib.linalg.Vector; import org.apache.spark.mllib.tree.configuration.Algo; import org.apache.spark.mllib.tree.model.DecisionTreeModel; import org.apache.spark.mllib.tree.model.RandomForestModel; import org.apache.spark.sql.Dataset; import org.apache.spark.sql.Row; import org.apache.spark.sql.RowFactory; import org.apache.spark.sql.SparkSession; import org.apache.spark.sql.types.DataTypes; import org.apache.spark.sql.types.Metadata; import org.apache.spark.sql.types.StructField; import org.apache.spark.sql.types.StructType; import scala.Enumeration; import java.util.ArrayList; import java.util.List; import java.util.Random; abstract class Predictor { abstract double predict(Vector vector); } public class MainConvertModels { public static final int seed = 42; public static void main(String[] args) { int numRows = 1000; int numFeatures = 3; int numClasses = 2; double trainFraction = 0.8; double testFraction = 0.2; SparkSession spark = SparkSession.builder() .appName("conversion app") .master("local") .getOrCreate(); Dataset data = getDummyData(spark, numRows, numFeatures, numClasses); Dataset[] splits = data.randomSplit(new double[]{trainFraction, testFraction}, seed); Dataset trainingData = splits[0]; Dataset testData = splits[1]; testData.cache(); List labels = getLabels(testData); List features = getFeatures(testData); DecisionTreeClassifier classifier1 = new DecisionTreeClassifier(); DecisionTreeClassificationModel model1 = classifier1.fit(trainingData); final DecisionTreeModel convertedModel1 = convertDecisionTreeModel(model1, Algo.Classification()); RandomForestClassifier classifier = new RandomForestClassifier(); RandomForestClassificationModel model2 = classifier.fit(trainingData); final RandomForestModel convertedModel2 = convertRandomForestModel(model2); System.out.println( "** DecisionTreeClassifier\n" + "** Original **" + getInfo(model1, testData) + "\n" + "** New **" + getInfo(new Predictor() { double predict(Vector vector) {return convertedModel1.predict(vector);} }, labels, features) + "\n" + "\n" + "** RandomForestClassifier\n" + "** Original **" + getInfo(model2, testData) + "\n" + "** New **" + getInfo(new Predictor() {double predict(Vector vector) {return convertedModel2.predict(vector);}}, labels, features) + "\n" + "\n" + ""); } static Dataset getDummyData(SparkSession spark, int numberRows, int numberFeatures, int labelUpperBound) { StructType schema = new StructType(new StructField[]{ new StructField("label", DataTypes.DoubleType, false, Metadata.empty()), new StructField("features", new VectorUDT(), false, Metadata.empty()) }); double[][] vectors = prepareData(numberRows, numberFeatures); Random random = new Random(seed); List dataTest = new ArrayList<>(); for (double[] vector : vectors) { double label = (double) random.nextInt(2); dataTest.add(RowFactory.create(label, Vectors.dense(vector))); } return spark.createDataFrame(dataTest, schema); } static double[][] prepareData(int numRows, int numFeatures) { Random random = new Random(seed); double[][] result = new double[numRows][numFeatures]; for (int row = 0; row < numRows; row++) { for (int feature = 0; feature < numFeatures; feature++) { result[row][feature] = random.nextDouble(); } } return result; } static
[jira] [Updated] (SPARK-19449) Inconsistent results between ml package RandomForestClassificationModel and mllib package RandomForestModel
[ https://issues.apache.org/jira/browse/SPARK-19449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aseem Bansal updated SPARK-19449: - Description: I worked on some code to convert ml package RandomForestClassificationModel to mllib package RandomForestModel. It was needed because we need to make predictions on the order of ms. I found that the results are inconsistent although the underlying DecisionTreeModel are exactly the same. The below code can be used to reproduce the issue. Can run this as a simple Java app as long as you have spark dependencies set up properly. {noformat} import org.apache.spark.ml.Transformer; import org.apache.spark.ml.classification.*; import org.apache.spark.ml.linalg.*; import org.apache.spark.ml.regression.RandomForestRegressionModel; import org.apache.spark.mllib.linalg.DenseVector; import org.apache.spark.mllib.linalg.Vector; import org.apache.spark.mllib.tree.configuration.Algo; import org.apache.spark.mllib.tree.model.DecisionTreeModel; import org.apache.spark.mllib.tree.model.RandomForestModel; import org.apache.spark.sql.Dataset; import org.apache.spark.sql.Row; import org.apache.spark.sql.RowFactory; import org.apache.spark.sql.SparkSession; import org.apache.spark.sql.types.DataTypes; import org.apache.spark.sql.types.Metadata; import org.apache.spark.sql.types.StructField; import org.apache.spark.sql.types.StructType; import scala.Enumeration; import java.util.ArrayList; import java.util.List; import java.util.Random; abstract class Predictor { abstract double predict(Vector vector); } public class MainConvertModels { public static final int seed = 42; public static void main(String[] args) { int numRows = 1000; int numFeatures = 3; int numClasses = 2; double trainFraction = 0.8; double testFraction = 0.2; SparkSession spark = SparkSession.builder() .appName("conversion app") .master("local") .getOrCreate(); //Dataset data = getData(spark, "libsvm", "/opt/spark2/data/mllib/sample_libsvm_data.txt"); Dataset data = getDummyData(spark, numRows, numFeatures, numClasses); Dataset[] splits = data.randomSplit(new double[]{trainFraction, testFraction}, seed); Dataset trainingData = splits[0]; Dataset testData = splits[1]; testData.cache(); List labels = getLabels(testData); List features = getFeatures(testData); DecisionTreeClassifier classifier1 = new DecisionTreeClassifier(); DecisionTreeClassificationModel model1 = classifier1.fit(trainingData); final DecisionTreeModel convertedModel1 = convertDecisionTreeModel(model1, Algo.Classification()); RandomForestClassifier classifier = new RandomForestClassifier(); RandomForestClassificationModel model2 = classifier.fit(trainingData); final RandomForestModel convertedModel2 = convertRandomForestModel(model2); LogisticRegression lr = new LogisticRegression(); LogisticRegressionModel model3 = lr.fit(trainingData); final org.apache.spark.mllib.classification.LogisticRegressionModel convertedModel3 = convertLogisticRegressionModel(model3); System.out.println( "** DecisionTreeClassifier\n" + "** Original **" + getInfo(model1, testData) + "\n" + "** New **" + getInfo(new Predictor() { double predict(Vector vector) {return convertedModel1.predict(vector);} }, labels, features) + "\n" + "\n" + "** RandomForestClassifier\n" + "** Original **" + getInfo(model2, testData) + "\n" + "** New **" + getInfo(new Predictor() {double predict(Vector vector) {return convertedModel2.predict(vector);}}, labels, features) + "\n" + "\n" + "** LogisticRegression\n" + "** Original **" + getInfo(model3, testData) + "\n" + "** New **" + getInfo(new Predictor() {double predict(Vector vector) { return convertedModel3.predict(vector);}}, labels, features) + "\n" + ""); } static Dataset getData(SparkSession spark, String format, String location) { return spark.read() .format(format) .load(location); } static Dataset getDummyData(SparkSession spark, int numberRows, int numberFeatures, int labelUpperBound) { StructType schema = new StructType(new StructField[]{ new StructField("label", DataTypes.DoubleType, false, Metadata.empty()), new StructField("features", new VectorUDT(), false, Metadata.empty()) }); double[][] vectors = prepareData(numberRows, numberFeatures);
[jira] [Updated] (SPARK-18874) First phase: Deferring the correlated predicate pull up to Optimizer phase
[ https://issues.apache.org/jira/browse/SPARK-18874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nattavut Sutyanyong updated SPARK-18874: Attachment: SPARK-18874-3.pdf Design document version 1.1 dated February 3, 2017. > First phase: Deferring the correlated predicate pull up to Optimizer phase > -- > > Key: SPARK-18874 > URL: https://issues.apache.org/jira/browse/SPARK-18874 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Nattavut Sutyanyong > Attachments: SPARK-18874-3.pdf > > > This JIRA implements the first phase of SPARK-18455 by deferring the > correlated predicate pull up from Analyzer to Optimizer. The goal is to > preserve the current functionality of subquery in Spark 2.0 (if it works, it > continues to work after this JIRA, if it does not, it won't). The performance > of subquery processing is expected to be at par with Spark 2.0. > The representation of the LogicalPlan after Analyzer will be different after > this JIRA that it will preserve the original positions of correlated > predicates in a subquery. This new representation is a preparation work for > the second phase of extending the support of correlated subquery to cases > Spark 2.0 does not support such as deep correlation, outer references in > SELECT clause. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18874) First phase: Deferring the correlated predicate pull up to Optimizer phase
[ https://issues.apache.org/jira/browse/SPARK-18874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15851550#comment-15851550 ] Nattavut Sutyanyong commented on SPARK-18874: - I have published a design document as a reference when reviewing the code. https://issues.apache.org/jira/secure/attachment/12850832/SPARK-18874-3.pdf > First phase: Deferring the correlated predicate pull up to Optimizer phase > -- > > Key: SPARK-18874 > URL: https://issues.apache.org/jira/browse/SPARK-18874 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Nattavut Sutyanyong > Attachments: SPARK-18874-3.pdf > > > This JIRA implements the first phase of SPARK-18455 by deferring the > correlated predicate pull up from Analyzer to Optimizer. The goal is to > preserve the current functionality of subquery in Spark 2.0 (if it works, it > continues to work after this JIRA, if it does not, it won't). The performance > of subquery processing is expected to be at par with Spark 2.0. > The representation of the LogicalPlan after Analyzer will be different after > this JIRA that it will preserve the original positions of correlated > predicates in a subquery. This new representation is a preparation work for > the second phase of extending the support of correlated subquery to cases > Spark 2.0 does not support such as deep correlation, outer references in > SELECT clause. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19449) Inconsistent results between ml package RandomForestClassificationModel and mllib package RandomForestModel
[ https://issues.apache.org/jira/browse/SPARK-19449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aseem Bansal updated SPARK-19449: - Description: I worked on some code to convert ml package RandomForestClassificationModel to mllib package RandomForestModel. It was needed because we need to make predictions on the order of ms. I found that the results are inconsistent although the underlying DecisionTreeModel are exactly the same. So the behavior between the 2 implementations is inconsistent which should not be the case. The below code can be used to reproduce the issue. Can run this as a simple Java app as long as you have spark dependencies set up properly. {noformat} import org.apache.spark.ml.Transformer; import org.apache.spark.ml.classification.*; import org.apache.spark.ml.linalg.*; import org.apache.spark.ml.regression.RandomForestRegressionModel; import org.apache.spark.mllib.linalg.DenseVector; import org.apache.spark.mllib.linalg.Vector; import org.apache.spark.mllib.tree.configuration.Algo; import org.apache.spark.mllib.tree.model.DecisionTreeModel; import org.apache.spark.mllib.tree.model.RandomForestModel; import org.apache.spark.sql.Dataset; import org.apache.spark.sql.Row; import org.apache.spark.sql.RowFactory; import org.apache.spark.sql.SparkSession; import org.apache.spark.sql.types.DataTypes; import org.apache.spark.sql.types.Metadata; import org.apache.spark.sql.types.StructField; import org.apache.spark.sql.types.StructType; import scala.Enumeration; import java.util.ArrayList; import java.util.List; import java.util.Random; abstract class Predictor { abstract double predict(Vector vector); } public class MainConvertModels { public static final int seed = 42; public static void main(String[] args) { int numRows = 1000; int numFeatures = 3; int numClasses = 2; double trainFraction = 0.8; double testFraction = 0.2; SparkSession spark = SparkSession.builder() .appName("conversion app") .master("local") .getOrCreate(); //Dataset data = getData(spark, "libsvm", "/opt/spark2/data/mllib/sample_libsvm_data.txt"); Dataset data = getDummyData(spark, numRows, numFeatures, numClasses); Dataset[] splits = data.randomSplit(new double[]{trainFraction, testFraction}, seed); Dataset trainingData = splits[0]; Dataset testData = splits[1]; testData.cache(); List labels = getLabels(testData); List features = getFeatures(testData); DecisionTreeClassifier classifier1 = new DecisionTreeClassifier(); DecisionTreeClassificationModel model1 = classifier1.fit(trainingData); final DecisionTreeModel convertedModel1 = convertDecisionTreeModel(model1, Algo.Classification()); RandomForestClassifier classifier = new RandomForestClassifier(); RandomForestClassificationModel model2 = classifier.fit(trainingData); final RandomForestModel convertedModel2 = convertRandomForestModel(model2); LogisticRegression lr = new LogisticRegression(); LogisticRegressionModel model3 = lr.fit(trainingData); final org.apache.spark.mllib.classification.LogisticRegressionModel convertedModel3 = convertLogisticRegressionModel(model3); System.out.println( "** DecisionTreeClassifier\n" + "** Original **" + getInfo(model1, testData) + "\n" + "** New **" + getInfo(new Predictor() { double predict(Vector vector) {return convertedModel1.predict(vector);} }, labels, features) + "\n" + "\n" + "** RandomForestClassifier\n" + "** Original **" + getInfo(model2, testData) + "\n" + "** New **" + getInfo(new Predictor() {double predict(Vector vector) {return convertedModel2.predict(vector);}}, labels, features) + "\n" + "\n" + "** LogisticRegression\n" + "** Original **" + getInfo(model3, testData) + "\n" + "** New **" + getInfo(new Predictor() {double predict(Vector vector) { return convertedModel3.predict(vector);}}, labels, features) + "\n" + ""); } static Dataset getData(SparkSession spark, String format, String location) { return spark.read() .format(format) .load(location); } static Dataset getDummyData(SparkSession spark, int numberRows, int numberFeatures, int labelUpperBound) { StructType schema = new StructType(new StructField[]{ new StructField("label", DataTypes.DoubleType, false, Metadata.empty()), new StructField("features", new VectorUDT(), false,
[jira] [Commented] (SPARK-19449) Inconsistent results between ml package RandomForestClassificationModel and mllib package RandomForestModel
[ https://issues.apache.org/jira/browse/SPARK-19449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15851540#comment-15851540 ] Sean Owen commented on SPARK-19449: --- Can you boil this down? this is a lot of code to look at. I would not necessarily expect the exact same results, even though a lot of code is shared, because of randomness and differences in ancillary processes like the pipeline elements that select training data and perform evaluation. > Inconsistent results between ml package RandomForestClassificationModel and > mllib package RandomForestModel > --- > > Key: SPARK-19449 > URL: https://issues.apache.org/jira/browse/SPARK-19449 > Project: Spark > Issue Type: Bug > Components: ML, MLlib >Affects Versions: 2.1.0 >Reporter: Aseem Bansal > > I worked on some code to convert ml package RandomForestClassificationModel > to mllib package RandomForestModel. It was needed because we need to make > predictions on the order of ms. I found that the results are inconsistent > although the underlying DecisionTreeModel are exactly the same. > The below code can be used to reproduce the issue. Can run this as a simple > Java app as long as you have spark dependencies set up properly. > {noformat} > import org.apache.spark.ml.Transformer; > import org.apache.spark.ml.classification.*; > import org.apache.spark.ml.linalg.*; > import org.apache.spark.ml.regression.RandomForestRegressionModel; > import org.apache.spark.mllib.linalg.DenseVector; > import org.apache.spark.mllib.linalg.Vector; > import org.apache.spark.mllib.tree.configuration.Algo; > import org.apache.spark.mllib.tree.model.DecisionTreeModel; > import org.apache.spark.mllib.tree.model.RandomForestModel; > import org.apache.spark.sql.Dataset; > import org.apache.spark.sql.Row; > import org.apache.spark.sql.RowFactory; > import org.apache.spark.sql.SparkSession; > import org.apache.spark.sql.types.DataTypes; > import org.apache.spark.sql.types.Metadata; > import org.apache.spark.sql.types.StructField; > import org.apache.spark.sql.types.StructType; > import scala.Enumeration; > import java.util.ArrayList; > import java.util.List; > import java.util.Random; > abstract class Predictor { > abstract double predict(Vector vector); > } > public class MainConvertModels { > public static final int seed = 42; > public static void main(String[] args) { > int numRows = 1000; > int numFeatures = 3; > int numClasses = 2; > double trainFraction = 0.8; > double testFraction = 0.2; > SparkSession spark = SparkSession.builder() > .appName("conversion app") > .master("local") > .getOrCreate(); > //Dataset data = getData(spark, "libsvm", > "/opt/spark2/data/mllib/sample_libsvm_data.txt"); > Dataset data = getDummyData(spark, numRows, numFeatures, > numClasses); > Dataset[] splits = data.randomSplit(new double[]{trainFraction, > testFraction}, seed); > Dataset trainingData = splits[0]; > Dataset testData = splits[1]; > testData.cache(); > List labels = getLabels(testData); > List features = getFeatures(testData); > DecisionTreeClassifier classifier1 = new DecisionTreeClassifier(); > DecisionTreeClassificationModel model1 = > classifier1.fit(trainingData); > final DecisionTreeModel convertedModel1 = > convertDecisionTreeModel(model1, Algo.Classification()); > RandomForestClassifier classifier = new RandomForestClassifier(); > RandomForestClassificationModel model2 = classifier.fit(trainingData); > final RandomForestModel convertedModel2 = > convertRandomForestModel(model2); > LogisticRegression lr = new LogisticRegression(); > LogisticRegressionModel model3 = lr.fit(trainingData); > final org.apache.spark.mllib.classification.LogisticRegressionModel > convertedModel3 = convertLogisticRegressionModel(model3); > System.out.println( > "** DecisionTreeClassifier\n" + > "** Original **" + getInfo(model1, testData) + "\n" + > "** New **" + getInfo(new Predictor() { > double predict(Vector vector) {return > convertedModel1.predict(vector);} > }, labels, features) + "\n" + > "\n" + > "** RandomForestClassifier\n" + > "** Original **" + getInfo(model2, testData) + "\n" + > "** New **" + getInfo(new Predictor() {double > predict(Vector vector) {return convertedModel2.predict(vector);}}, labels, > features) + "\n" + > "\n" + >
[jira] [Assigned] (SPARK-16742) Kerberos support for Spark on Mesos
[ https://issues.apache.org/jira/browse/SPARK-16742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16742: Assignee: Apache Spark > Kerberos support for Spark on Mesos > --- > > Key: SPARK-16742 > URL: https://issues.apache.org/jira/browse/SPARK-16742 > Project: Spark > Issue Type: New Feature > Components: Mesos >Reporter: Michael Gummelt >Assignee: Apache Spark > > We at Mesosphere have written Kerberos support for Spark on Mesos. We'll be > contributing it to Apache Spark soon. > Mesosphere design doc: > https://docs.google.com/document/d/1xyzICg7SIaugCEcB4w1vBWp24UDkyJ1Pyt2jtnREFqc/edit#heading=h.tdnq7wilqrj6 > Mesosphere code: > https://github.com/mesosphere/spark/commit/73ba2ab8d97510d5475ef9a48c673ce34f7173fa -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16742) Kerberos support for Spark on Mesos
[ https://issues.apache.org/jira/browse/SPARK-16742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16742: Assignee: (was: Apache Spark) > Kerberos support for Spark on Mesos > --- > > Key: SPARK-16742 > URL: https://issues.apache.org/jira/browse/SPARK-16742 > Project: Spark > Issue Type: New Feature > Components: Mesos >Reporter: Michael Gummelt > > We at Mesosphere have written Kerberos support for Spark on Mesos. We'll be > contributing it to Apache Spark soon. > Mesosphere design doc: > https://docs.google.com/document/d/1xyzICg7SIaugCEcB4w1vBWp24UDkyJ1Pyt2jtnREFqc/edit#heading=h.tdnq7wilqrj6 > Mesosphere code: > https://github.com/mesosphere/spark/commit/73ba2ab8d97510d5475ef9a48c673ce34f7173fa -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16742) Kerberos support for Spark on Mesos
[ https://issues.apache.org/jira/browse/SPARK-16742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15851539#comment-15851539 ] Apache Spark commented on SPARK-16742: -- User 'arinconstrio' has created a pull request for this issue: https://github.com/apache/spark/pull/16788 > Kerberos support for Spark on Mesos > --- > > Key: SPARK-16742 > URL: https://issues.apache.org/jira/browse/SPARK-16742 > Project: Spark > Issue Type: New Feature > Components: Mesos >Reporter: Michael Gummelt > > We at Mesosphere have written Kerberos support for Spark on Mesos. We'll be > contributing it to Apache Spark soon. > Mesosphere design doc: > https://docs.google.com/document/d/1xyzICg7SIaugCEcB4w1vBWp24UDkyJ1Pyt2jtnREFqc/edit#heading=h.tdnq7wilqrj6 > Mesosphere code: > https://github.com/mesosphere/spark/commit/73ba2ab8d97510d5475ef9a48c673ce34f7173fa -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19449) Inconsistent results between ml package RandomForestClassificationModel and mllib package RandomForestModel
Aseem Bansal created SPARK-19449: Summary: Inconsistent results between ml package RandomForestClassificationModel and mllib package RandomForestModel Key: SPARK-19449 URL: https://issues.apache.org/jira/browse/SPARK-19449 Project: Spark Issue Type: Bug Components: ML, MLlib Affects Versions: 2.1.0 Reporter: Aseem Bansal I worked on some code to convert ml package RandomForestClassificationModel to mllib package RandomForestModel. It was needed because we need to make predictions on the order of ms. I found that the results are inconsistent although the underlying DecisionTreeModel are exactly the same. The below code can be used to reproduce the issue. {noformat} import org.apache.spark.ml.Transformer; import org.apache.spark.ml.classification.*; import org.apache.spark.ml.linalg.*; import org.apache.spark.ml.regression.RandomForestRegressionModel; import org.apache.spark.mllib.linalg.DenseVector; import org.apache.spark.mllib.linalg.Vector; import org.apache.spark.mllib.tree.configuration.Algo; import org.apache.spark.mllib.tree.model.DecisionTreeModel; import org.apache.spark.mllib.tree.model.RandomForestModel; import org.apache.spark.sql.Dataset; import org.apache.spark.sql.Row; import org.apache.spark.sql.RowFactory; import org.apache.spark.sql.SparkSession; import org.apache.spark.sql.types.DataTypes; import org.apache.spark.sql.types.Metadata; import org.apache.spark.sql.types.StructField; import org.apache.spark.sql.types.StructType; import scala.Enumeration; import java.util.ArrayList; import java.util.List; import java.util.Random; abstract class Predictor { abstract double predict(Vector vector); } public class MainConvertModels { public static final int seed = 42; public static void main(String[] args) { int numRows = 1000; int numFeatures = 3; int numClasses = 2; double trainFraction = 0.8; double testFraction = 0.2; SparkSession spark = SparkSession.builder() .appName("conversion app") .master("local") .getOrCreate(); //Dataset data = getData(spark, "libsvm", "/opt/spark2/data/mllib/sample_libsvm_data.txt"); Dataset data = getDummyData(spark, numRows, numFeatures, numClasses); Dataset[] splits = data.randomSplit(new double[]{trainFraction, testFraction}, seed); Dataset trainingData = splits[0]; Dataset testData = splits[1]; testData.cache(); List labels = getLabels(testData); List features = getFeatures(testData); DecisionTreeClassifier classifier1 = new DecisionTreeClassifier(); DecisionTreeClassificationModel model1 = classifier1.fit(trainingData); final DecisionTreeModel convertedModel1 = convertDecisionTreeModel(model1, Algo.Classification()); RandomForestClassifier classifier = new RandomForestClassifier(); RandomForestClassificationModel model2 = classifier.fit(trainingData); final RandomForestModel convertedModel2 = convertRandomForestModel(model2); LogisticRegression lr = new LogisticRegression(); LogisticRegressionModel model3 = lr.fit(trainingData); final org.apache.spark.mllib.classification.LogisticRegressionModel convertedModel3 = convertLogisticRegressionModel(model3); System.out.println( "** DecisionTreeClassifier\n" + "** Original **" + getInfo(model1, testData) + "\n" + "** New **" + getInfo(new Predictor() { double predict(Vector vector) {return convertedModel1.predict(vector);} }, labels, features) + "\n" + "\n" + "** RandomForestClassifier\n" + "** Original **" + getInfo(model2, testData) + "\n" + "** New **" + getInfo(new Predictor() {double predict(Vector vector) {return convertedModel2.predict(vector);}}, labels, features) + "\n" + "\n" + "** LogisticRegression\n" + "** Original **" + getInfo(model3, testData) + "\n" + "** New **" + getInfo(new Predictor() {double predict(Vector vector) { return convertedModel3.predict(vector);}}, labels, features) + "\n" + ""); } static Dataset getData(SparkSession spark, String format, String location) { return spark.read() .format(format) .load(location); } static Dataset getDummyData(SparkSession spark, int numberRows, int numberFeatures, int labelUpperBound) { StructType schema = new StructType(new StructField[]{ new StructField("label", DataTypes.DoubleType, false, Metadata.empty()),
[jira] [Assigned] (SPARK-19244) Sort MemoryConsumers according to their memory usage when spilling
[ https://issues.apache.org/jira/browse/SPARK-19244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mridul Muralidharan reassigned SPARK-19244: --- Assignee: Liang-Chi Hsieh > Sort MemoryConsumers according to their memory usage when spilling > -- > > Key: SPARK-19244 > URL: https://issues.apache.org/jira/browse/SPARK-19244 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Liang-Chi Hsieh >Assignee: Liang-Chi Hsieh > Fix For: 2.2.0 > > > In `TaskMemoryManager `, when we acquire memory by calling > `acquireExecutionMemory` and we can't acquire required memory, we will try to > spill other memory consumers. > Currently, we simply iterates the memory consumers in a hash set. Normally > each time the consumer will be iterated in the same order. > The first issue is that we might spill additional consumers. For example, if > consumer 1 uses 10MB, consumer 2 uses 50MB, then consumer 3 acquires 100MB > but we can only get 60MB and spilling is needed. We might spill both consumer > 1 and consumer 2. But we actually just need to spill consumer 2 and get the > required 100MB. > The second issue is that if we spill consumer 1 in first time spilling. After > a while, consumer 1 now uses 5MB. Then consumer 4 may acquire some memory and > spilling is needed again. Because we iterate the memory consumers in the same > order, we will spill consumer 1 again. So for consumer 1, we will produce > many small spilling files. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19244) Sort MemoryConsumers according to their memory usage when spilling
[ https://issues.apache.org/jira/browse/SPARK-19244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mridul Muralidharan resolved SPARK-19244. - Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 16603 [https://github.com/apache/spark/pull/16603] > Sort MemoryConsumers according to their memory usage when spilling > -- > > Key: SPARK-19244 > URL: https://issues.apache.org/jira/browse/SPARK-19244 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Liang-Chi Hsieh > Fix For: 2.2.0 > > > In `TaskMemoryManager `, when we acquire memory by calling > `acquireExecutionMemory` and we can't acquire required memory, we will try to > spill other memory consumers. > Currently, we simply iterates the memory consumers in a hash set. Normally > each time the consumer will be iterated in the same order. > The first issue is that we might spill additional consumers. For example, if > consumer 1 uses 10MB, consumer 2 uses 50MB, then consumer 3 acquires 100MB > but we can only get 60MB and spilling is needed. We might spill both consumer > 1 and consumer 2. But we actually just need to spill consumer 2 and get the > required 100MB. > The second issue is that if we spill consumer 1 in first time spilling. After > a while, consumer 1 now uses 5MB. Then consumer 4 may acquire some memory and > spilling is needed again. Because we iterate the memory consumers in the same > order, we will spill consumer 1 again. So for consumer 1, we will produce > many small spilling files. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19444) Tokenizer example does not compile without extra imports
[ https://issues.apache.org/jira/browse/SPARK-19444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15851515#comment-15851515 ] Sean Owen commented on SPARK-19444: --- You're right, I think this may be a copy-and-paste problem. I don't know much about the example include mechanism, but it looks like this is a way to tag code as part of only a certain example. In this case, we don't want that tag whereas it might have been relevant in its source. You can just make all of the imports part of one "example on" block. > Tokenizer example does not compile without extra imports > > > Key: SPARK-19444 > URL: https://issues.apache.org/jira/browse/SPARK-19444 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 2.1.0 >Reporter: Aseem Bansal >Priority: Minor > > The example at http://spark.apache.org/docs/2.1.0/ml-features.html#tokenizer > does not compile without the following static import > import static org.apache.spark.sql.functions.*; -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16043) Prepare GenericArrayData implementation specialized for a primitive array
[ https://issues.apache.org/jira/browse/SPARK-16043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kazuaki Ishizaki resolved SPARK-16043. -- Resolution: Fixed Fix Version/s: 2.2.0 > Prepare GenericArrayData implementation specialized for a primitive array > - > > Key: SPARK-16043 > URL: https://issues.apache.org/jira/browse/SPARK-16043 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Kazuaki Ishizaki > Fix For: 2.2.0 > > > There is a ToDo of GenericArrayData class, which is to eliminate > boxing/unboxing for a primitive array (described > [here|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/GenericArrayData.scala#L31]) > It would be good to prepare GenericArrayData implementation specialized for a > primitive array to eliminate boxing/unboxing from the view of runtime memory > footprint and performance. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16042) Eliminate nullcheck code at projection for an array type
[ https://issues.apache.org/jira/browse/SPARK-16042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-16042. --- Resolution: Duplicate Subsumed by another issue according to PR > Eliminate nullcheck code at projection for an array type > > > Key: SPARK-16042 > URL: https://issues.apache.org/jira/browse/SPARK-16042 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Kazuaki Ishizaki > > When we run a spark program with a projection for a array type, nullcheck at > a call to write each element of an array is generated. If we know all of the > elements do not have {{null}} at compilation time, we can eliminate code for > nullcheck. > {code} > val df = sparkContext.parallelize(Seq(1.0, 2.0), 1).toDF("v") > df.selectExpr("Array(v + 2.2, v + 3.3)").collect > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16094) Support HashAggregateExec for non-partial aggregates
[ https://issues.apache.org/jira/browse/SPARK-16094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-16094. --- Resolution: Won't Fix > Support HashAggregateExec for non-partial aggregates > > > Key: SPARK-16094 > URL: https://issues.apache.org/jira/browse/SPARK-16094 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.1 >Reporter: Takeshi Yamamuro > > The current spark cannot use `HashAggregateExec` for non-partial aggregates > because `Collect` (`CollectSet`/`CollectList`) has a single shared buffer > inside. Since SortAggregateExec is expensive for bigger data, we'd better off > fixing this. > This ticket is intended to change plans from > {code} > SortAggregate(key=[key#3077], functions=[collect_set(value#3078, 0, 0)], > output=[key#3077,collect_set(value)#3088]) > +- *Sort [key#3077 ASC], false, 0 >+- Exchange hashpartitioning(key#3077, 5) > +- Scan ExistingRDD[key#3077,value#3078] > {code} > into > {code} > HashAggregate(keys=[key#3077], functions=[collect_set(value#3078, 0, 0)], > output=[key#3077, collect_set(value)#3088]) > +- Exchange hashpartitioning(key#3077, 5) >+- Scan ExistingRDD[key#3077,value#3078] > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16041) Disallow Duplicate Columns in `partitionBy`, `bucketBy` and `sortBy`
[ https://issues.apache.org/jira/browse/SPARK-16041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-16041. --- Resolution: Duplicate This was apparently subsumed by another issue. > Disallow Duplicate Columns in `partitionBy`, `bucketBy` and `sortBy` > > > Key: SPARK-16041 > URL: https://issues.apache.org/jira/browse/SPARK-16041 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiao Li > > Duplicate columns are not allowed in `partitionBy`, `bucketBy`, `sortBy` in > DataFrameWriter. The duplicate columns could cause unpredictable results. For > example, the resolution failure. > We should detect the duplicates and issue exceptions with appropriate > messages. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-16200) Rename AggregateFunction#supportsPartial
[ https://issues.apache.org/jira/browse/SPARK-16200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro closed SPARK-16200. Resolution: Won't Fix > Rename AggregateFunction#supportsPartial > > > Key: SPARK-16200 > URL: https://issues.apache.org/jira/browse/SPARK-16200 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.1 >Reporter: Takeshi Yamamuro > > We'd be also better to rename this variable instead of supportsPartial > because it's kinds of misleading. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-16200) Rename AggregateFunction#supportsPartial
[ https://issues.apache.org/jira/browse/SPARK-16200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15851501#comment-15851501 ] Takeshi Yamamuro edited comment on SPARK-16200 at 2/3/17 1:54 PM: -- okay, thanks for letting me know! It's okay to set "Won't Fix". was (Author: maropu): okay, thanks for letting me know! It's okay to set "Resolved". > Rename AggregateFunction#supportsPartial > > > Key: SPARK-16200 > URL: https://issues.apache.org/jira/browse/SPARK-16200 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.1 >Reporter: Takeshi Yamamuro > > We'd be also better to rename this variable instead of supportsPartial > because it's kinds of misleading. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16200) Rename AggregateFunction#supportsPartial
[ https://issues.apache.org/jira/browse/SPARK-16200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15851501#comment-15851501 ] Takeshi Yamamuro commented on SPARK-16200: -- okay, thanks for letting me know! It's okay to set "Resolved". > Rename AggregateFunction#supportsPartial > > > Key: SPARK-16200 > URL: https://issues.apache.org/jira/browse/SPARK-16200 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.1 >Reporter: Takeshi Yamamuro > > We'd be also better to rename this variable instead of supportsPartial > because it's kinds of misleading. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16043) Prepare GenericArrayData implementation specialized for a primitive array
[ https://issues.apache.org/jira/browse/SPARK-16043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15851500#comment-15851500 ] Takeshi Yamamuro commented on SPARK-16043: -- I think the issue this ticket describes has been almost resolved in SPARK-14850. How about setting `Resolved` about this? And then, if we have other left issues related to this, it'd be better to open new JIRAs, thought? > Prepare GenericArrayData implementation specialized for a primitive array > - > > Key: SPARK-16043 > URL: https://issues.apache.org/jira/browse/SPARK-16043 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Kazuaki Ishizaki > > There is a ToDo of GenericArrayData class, which is to eliminate > boxing/unboxing for a primitive array (described > [here|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/GenericArrayData.scala#L31]) > It would be good to prepare GenericArrayData implementation specialized for a > primitive array to eliminate boxing/unboxing from the view of runtime memory > footprint and performance. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-15180) Support subexpression elimination in Fliter
[ https://issues.apache.org/jira/browse/SPARK-15180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liang-Chi Hsieh closed SPARK-15180. --- Resolution: Won't Fix > Support subexpression elimination in Fliter > --- > > Key: SPARK-15180 > URL: https://issues.apache.org/jira/browse/SPARK-15180 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Liang-Chi Hsieh > > Wholestage filter doesn't support subexpression elimination now. We should > add this support. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15180) Support subexpression elimination in Fliter
[ https://issues.apache.org/jira/browse/SPARK-15180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15851491#comment-15851491 ] Liang-Chi Hsieh commented on SPARK-15180: - [~hyukjin.kwon] Yes. I resolved this. Thanks! > Support subexpression elimination in Fliter > --- > > Key: SPARK-15180 > URL: https://issues.apache.org/jira/browse/SPARK-15180 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Liang-Chi Hsieh > > Wholestage filter doesn't support subexpression elimination now. We should > add this support. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15911) Remove additional Project to be consistent with SQL when insert into table
[ https://issues.apache.org/jira/browse/SPARK-15911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15851488#comment-15851488 ] Liang-Chi Hsieh commented on SPARK-15911: - [~hyukjin.kwon] Thanks! > Remove additional Project to be consistent with SQL when insert into table > -- > > Key: SPARK-15911 > URL: https://issues.apache.org/jira/browse/SPARK-15911 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Liang-Chi Hsieh > > Currently In DataFrameWriter's insertInto and ResolveRelations of Analyzer, > we add additional Project to adjust column ordering. However, it should be > using ordering not name for this resolution. We should fix it. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17161) Add PySpark-ML JavaWrapper convenience function to create py4j JavaArrays
[ https://issues.apache.org/jira/browse/SPARK-17161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] holdenk resolved SPARK-17161. - Resolution: Fixed Fix Version/s: 2.2.0 > Add PySpark-ML JavaWrapper convenience function to create py4j JavaArrays > - > > Key: SPARK-17161 > URL: https://issues.apache.org/jira/browse/SPARK-17161 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Reporter: Bryan Cutler >Priority: Minor > Fix For: 2.2.0 > > > Often in Spark ML, there are classes that use a Scala Array in a constructor. > In order to add the same API to Python, a Java-friendly alternate > constructor needs to exist to be compatible with py4j when converting from a > list. This is because the current conversion in PySpark _py2java creates a > java.util.ArrayList, as shown in this error msg > {noformat} > Py4JError: An error occurred while calling > None.org.apache.spark.ml.feature.CountVectorizerModel. Trace: > py4j.Py4JException: Constructor > org.apache.spark.ml.feature.CountVectorizerModel([class java.util.ArrayList]) > does not exist > at > py4j.reflection.ReflectionEngine.getConstructor(ReflectionEngine.java:179) > at > py4j.reflection.ReflectionEngine.getConstructor(ReflectionEngine.java:196) > at py4j.Gateway.invoke(Gateway.java:235) > {noformat} > Creating an alternate constructor can be avoided by creating a py4j JavaArray > using {{new_array}}. This type is compatible with the Scala Array currently > used in classes like {{CountVectorizerModel}} and {{StringIndexerModel}}. > Most of the boiler-plate Python code to do this can be put in a convenience > function inside of ml.JavaWrapper to give a clean way of constructing ML > objects without adding special constructors. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16200) Rename AggregateFunction#supportsPartial
[ https://issues.apache.org/jira/browse/SPARK-16200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15851474#comment-15851474 ] Hyukjin Kwon commented on SPARK-16200: -- (maybe it seems good to double-check this one too per https://github.com/apache/spark/pull/13852#issuecomment-242347430) > Rename AggregateFunction#supportsPartial > > > Key: SPARK-16200 > URL: https://issues.apache.org/jira/browse/SPARK-16200 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.1 >Reporter: Takeshi Yamamuro > > We'd be also better to rename this variable instead of supportsPartial > because it's kinds of misleading. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19448) unify some duplication function in MetaStoreRelation
Song Jun created SPARK-19448: Summary: unify some duplication function in MetaStoreRelation Key: SPARK-19448 URL: https://issues.apache.org/jira/browse/SPARK-19448 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.2.0 Reporter: Song Jun Priority: Minor 1. MetaStoreRelation' hiveQlTable can be replaced by calling HiveClientImpl's toHiveTable 2. MetaStoreRelation's toHiveColumn can be replaced by calling HiveClientImpl's toHiveColumn 3. process another TODO https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/MetastoreRelation.scala#L234 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16094) Support HashAggregateExec for non-partial aggregates
[ https://issues.apache.org/jira/browse/SPARK-16094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15851473#comment-15851473 ] Hyukjin Kwon commented on SPARK-16094: -- [~maropu], I just happened to see this JIRA. Maybe would this JIRA be resolvable per https://github.com/apache/spark/pull/13802#issuecomment-243756620? > Support HashAggregateExec for non-partial aggregates > > > Key: SPARK-16094 > URL: https://issues.apache.org/jira/browse/SPARK-16094 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.1 >Reporter: Takeshi Yamamuro > > The current spark cannot use `HashAggregateExec` for non-partial aggregates > because `Collect` (`CollectSet`/`CollectList`) has a single shared buffer > inside. Since SortAggregateExec is expensive for bigger data, we'd better off > fixing this. > This ticket is intended to change plans from > {code} > SortAggregate(key=[key#3077], functions=[collect_set(value#3078, 0, 0)], > output=[key#3077,collect_set(value)#3088]) > +- *Sort [key#3077 ASC], false, 0 >+- Exchange hashpartitioning(key#3077, 5) > +- Scan ExistingRDD[key#3077,value#3078] > {code} > into > {code} > HashAggregate(keys=[key#3077], functions=[collect_set(value#3078, 0, 0)], > output=[key#3077, collect_set(value)#3088]) > +- Exchange hashpartitioning(key#3077, 5) >+- Scan ExistingRDD[key#3077,value#3078] > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19372) Code generation for Filter predicate including many OR conditions exceeds JVM method size limit
[ https://issues.apache.org/jira/browse/SPARK-19372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15851472#comment-15851472 ] Kazuaki Ishizaki commented on SPARK-19372: -- I was able to reproduce this. I am thinking how to reduce bytecode size per Java method. > Code generation for Filter predicate including many OR conditions exceeds JVM > method size limit > > > Key: SPARK-19372 > URL: https://issues.apache.org/jira/browse/SPARK-19372 > Project: Spark > Issue Type: Bug >Affects Versions: 2.1.0 >Reporter: Jay Pranavamurthi > Attachments: wide400cols.csv > > > For the attached csv file, the code below causes the exception > "org.codehaus.janino.JaninoRuntimeException: Code of method > "(Lorg/apache/spark/sql/catalyst/InternalRow;)Z" of class > "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate" > grows beyond 64 KB > Code: > {code:borderStyle=solid} > val conf = new SparkConf().setMaster("local[1]") > val sqlContext = > SparkSession.builder().config(conf).getOrCreate().sqlContext > val dataframe = > sqlContext > .read > .format("com.databricks.spark.csv") > .load("wide400cols.csv") > val filter = (0 to 399) > .foldLeft(lit(false))((e, index) => > e.or(dataframe.col(dataframe.columns(index)) =!= s"column${index+1}")) > val filtered = dataframe.filter(filter) > filtered.show(100) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16043) Prepare GenericArrayData implementation specialized for a primitive array
[ https://issues.apache.org/jira/browse/SPARK-16043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15851471#comment-15851471 ] Hyukjin Kwon commented on SPARK-16043: -- (maybe would be great if this one is checked too per https://github.com/apache/spark/pull/13758#issuecomment-269589087) > Prepare GenericArrayData implementation specialized for a primitive array > - > > Key: SPARK-16043 > URL: https://issues.apache.org/jira/browse/SPARK-16043 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Kazuaki Ishizaki > > There is a ToDo of GenericArrayData class, which is to eliminate > boxing/unboxing for a primitive array (described > [here|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/GenericArrayData.scala#L31]) > It would be good to prepare GenericArrayData implementation specialized for a > primitive array to eliminate boxing/unboxing from the view of runtime memory > footprint and performance. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16042) Eliminate nullcheck code at projection for an array type
[ https://issues.apache.org/jira/browse/SPARK-16042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15851469#comment-15851469 ] Hyukjin Kwon commented on SPARK-16042: -- [~kiszk], would this JIRA maybe be resolvable per https://github.com/apache/spark/pull/13757#issuecomment-270453328? > Eliminate nullcheck code at projection for an array type > > > Key: SPARK-16042 > URL: https://issues.apache.org/jira/browse/SPARK-16042 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Kazuaki Ishizaki > > When we run a spark program with a projection for a array type, nullcheck at > a call to write each element of an array is generated. If we know all of the > elements do not have {{null}} at compilation time, we can eliminate code for > nullcheck. > {code} > val df = sparkContext.parallelize(Seq(1.0, 2.0), 1).toDF("v") > df.selectExpr("Array(v + 2.2, v + 3.3)").collect > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16041) Disallow Duplicate Columns in `partitionBy`, `bucketBy` and `sortBy`
[ https://issues.apache.org/jira/browse/SPARK-16041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15851468#comment-15851468 ] Hyukjin Kwon commented on SPARK-16041: -- [~smilegator], I just happened to see this JIRA just while looking through the history. Per https://github.com/apache/spark/pull/13756#issuecomment-237658676, would this JIRA be resolvable maybe? > Disallow Duplicate Columns in `partitionBy`, `bucketBy` and `sortBy` > > > Key: SPARK-16041 > URL: https://issues.apache.org/jira/browse/SPARK-16041 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiao Li > > Duplicate columns are not allowed in `partitionBy`, `bucketBy`, `sortBy` in > DataFrameWriter. The duplicate columns could cause unpredictable results. For > example, the resolution failure. > We should detect the duplicates and issue exceptions with appropriate > messages. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15911) Remove additional Project to be consistent with SQL when insert into table
[ https://issues.apache.org/jira/browse/SPARK-15911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-15911. -- Resolution: Duplicate I am resolving this per https://github.com/apache/spark/pull/13631#issuecomment-227087585 > Remove additional Project to be consistent with SQL when insert into table > -- > > Key: SPARK-15911 > URL: https://issues.apache.org/jira/browse/SPARK-15911 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Liang-Chi Hsieh > > Currently In DataFrameWriter's insertInto and ResolveRelations of Analyzer, > we add additional Project to adjust column ordering. However, it should be > using ordering not name for this resolution. We should fix it. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org