[jira] [Updated] (SPARK-8177) date/time function: year
[ https://issues.apache.org/jira/browse/SPARK-8177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-8177: --- Description: {code} year(timestamp time): int {code} Returns the year part of a date or a timestamp string: year("1970-01-01 00:00:00") = 1970, year("1970-01-01") = 1970. was: year(string|date|timestamp): int Returns the year part of a date or a timestamp string: year("1970-01-01 00:00:00") = 1970, year("1970-01-01") = 1970. See https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF > date/time function: year > > > Key: SPARK-8177 > URL: https://issues.apache.org/jira/browse/SPARK-8177 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin > > {code} > year(timestamp time): int > {code} > Returns the year part of a date or a timestamp string: year("1970-01-01 > 00:00:00") = 1970, year("1970-01-01") = 1970. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8176) date/time function: to_date
[ https://issues.apache.org/jira/browse/SPARK-8176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-8176: --- Description: parse a timestamp string and return the date portion {code} to_date(string timestamp): date {code} Returns the date part of a timestamp string: to_date("1970-01-01 00:00:00") = "1970-01-01" (in some date format) See https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF was: # parse a timestamp string and return the date portion to_date(string timestamp): date Returns the date part of a timestamp string: to_date("1970-01-01 00:00:00") = "1970-01-01" (in some date format) See https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF > date/time function: to_date > --- > > Key: SPARK-8176 > URL: https://issues.apache.org/jira/browse/SPARK-8176 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin > > parse a timestamp string and return the date portion > {code} > to_date(string timestamp): date > {code} > Returns the date part of a timestamp string: to_date("1970-01-01 00:00:00") = > "1970-01-01" (in some date format) > See https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8176) date/time function: to_date
[ https://issues.apache.org/jira/browse/SPARK-8176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-8176: --- Description: # parse a timestamp string and return the date portion to_date(string timestamp): date Returns the date part of a timestamp string: to_date("1970-01-01 00:00:00") = "1970-01-01" (in some date format) See https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF was: to_date(date|timestamp): date to_date(string): string Returns the date part of a timestamp string: to_date("1970-01-01 00:00:00") = "1970-01-01". See https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF > date/time function: to_date > --- > > Key: SPARK-8176 > URL: https://issues.apache.org/jira/browse/SPARK-8176 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin > > # parse a timestamp string and return the date portion > to_date(string timestamp): date > Returns the date part of a timestamp string: to_date("1970-01-01 00:00:00") = > "1970-01-01" (in some date format) > See https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8176) date/time function: to_date
[ https://issues.apache.org/jira/browse/SPARK-8176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-8176: --- Description: parse a timestamp string and return the date portion {code} to_date(string timestamp): date {code} Returns the date part of a timestamp string: to_date("1970-01-01 00:00:00") = "1970-01-01" (in some date format) was: parse a timestamp string and return the date portion {code} to_date(string timestamp): date {code} Returns the date part of a timestamp string: to_date("1970-01-01 00:00:00") = "1970-01-01" (in some date format) See https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF > date/time function: to_date > --- > > Key: SPARK-8176 > URL: https://issues.apache.org/jira/browse/SPARK-8176 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin > > parse a timestamp string and return the date portion > {code} > to_date(string timestamp): date > {code} > Returns the date part of a timestamp string: to_date("1970-01-01 00:00:00") = > "1970-01-01" (in some date format) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8753) Create an IntervalType data type
[ https://issues.apache.org/jira/browse/SPARK-8753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-8753: --- Issue Type: Sub-task (was: New Feature) Parent: SPARK-8159 > Create an IntervalType data type > > > Key: SPARK-8753 > URL: https://issues.apache.org/jira/browse/SPARK-8753 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin > > We should create an IntervalType data type that represents time intervals. > Internally, we can use a long value to store it, similar to Timestamp (i.e. > 100ns precision). This data type initially cannot be stored externally, but > only used for expressions. > 1. Add IntervalType data type. > 2. Add parser support in our SQL expression, in the form of > {code} > INTERVAL [number] [unit] > {code} > unit can be YEAR[S], MONTH[S], WEEK[S], DAY[S], HOUR[S], MINUTE[S], > SECOND[S], MILLISECOND[S], MICROSECOND[S], or NANOSECOND[S]. > 3. Add in the analyzer to make sure we throw some exception to prevent saving > a dataframe/table with IntervalType out to external systems. > Related Hive ticket: https://issues.apache.org/jira/browse/HIVE-9792 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8810) Gaps in SQL UDF test coverage
[ https://issues.apache.org/jira/browse/SPARK-8810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8810: --- Assignee: (was: Apache Spark) > Gaps in SQL UDF test coverage > - > > Key: SPARK-8810 > URL: https://issues.apache.org/jira/browse/SPARK-8810 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 1.2.0 > Environment: all >Reporter: Spiro Michaylov > Labels: test > > SQL UDFs are untested in GROUP BY, WHERE and HAVING clauses, and in > combination. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8810) Gaps in SQL UDF test coverage
[ https://issues.apache.org/jira/browse/SPARK-8810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14612871#comment-14612871 ] Apache Spark commented on SPARK-8810: - User 'spirom' has created a pull request for this issue: https://github.com/apache/spark/pull/7207 > Gaps in SQL UDF test coverage > - > > Key: SPARK-8810 > URL: https://issues.apache.org/jira/browse/SPARK-8810 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 1.2.0 > Environment: all >Reporter: Spiro Michaylov > Labels: test > > SQL UDFs are untested in GROUP BY, WHERE and HAVING clauses, and in > combination. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8810) Gaps in SQL UDF test coverage
[ https://issues.apache.org/jira/browse/SPARK-8810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8810: --- Assignee: Apache Spark > Gaps in SQL UDF test coverage > - > > Key: SPARK-8810 > URL: https://issues.apache.org/jira/browse/SPARK-8810 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 1.2.0 > Environment: all >Reporter: Spiro Michaylov >Assignee: Apache Spark > Labels: test > > SQL UDFs are untested in GROUP BY, WHERE and HAVING clauses, and in > combination. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8810) Gaps in SQL UDF test coverage
Spiro Michaylov created SPARK-8810: -- Summary: Gaps in SQL UDF test coverage Key: SPARK-8810 URL: https://issues.apache.org/jira/browse/SPARK-8810 Project: Spark Issue Type: Test Components: SQL Affects Versions: 1.2.0 Environment: all Reporter: Spiro Michaylov SQL UDFs are untested in GROUP BY, WHERE and HAVING clauses, and in combination. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8809) Remove ConvertNaNs analyzer rule
[ https://issues.apache.org/jira/browse/SPARK-8809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8809: --- Assignee: Reynold Xin (was: Apache Spark) > Remove ConvertNaNs analyzer rule > > > Key: SPARK-8809 > URL: https://issues.apache.org/jira/browse/SPARK-8809 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > > Cast already handles "NaN" when casting from string to double/float. I don't > think this rule is necessary anymore. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8809) Remove ConvertNaNs analyzer rule
[ https://issues.apache.org/jira/browse/SPARK-8809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8809: --- Assignee: Apache Spark (was: Reynold Xin) > Remove ConvertNaNs analyzer rule > > > Key: SPARK-8809 > URL: https://issues.apache.org/jira/browse/SPARK-8809 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Reynold Xin >Assignee: Apache Spark > > Cast already handles "NaN" when casting from string to double/float. I don't > think this rule is necessary anymore. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8809) Remove ConvertNaNs analyzer rule
[ https://issues.apache.org/jira/browse/SPARK-8809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14612856#comment-14612856 ] Apache Spark commented on SPARK-8809: - User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/7206 > Remove ConvertNaNs analyzer rule > > > Key: SPARK-8809 > URL: https://issues.apache.org/jira/browse/SPARK-8809 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > > Cast already handles "NaN" when casting from string to double/float. I don't > think this rule is necessary anymore. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-5016) GaussianMixtureEM should distribute matrix inverse for large numFeatures, k
[ https://issues.apache.org/jira/browse/SPARK-5016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14612847#comment-14612847 ] Feynman Liang edited comment on SPARK-5016 at 7/3/15 5:20 AM: -- I did some [perf testing | https://gist.github.com/feynmanliang/70d79c23dffc828939ec] and it shows that distributing the Gaussians does yield a significant improvement in performance when the number of clusters and dimensionality of the data is sufficiently large (>30 dimensions, >10 clusters). In particular, the "typical" use case of 40 dimensions and 10k clusters gains about 15 seconds in runtime when distributing the Gaussians. was (Author: fliang): I did some [perf testing](https://gist.github.com/feynmanliang/70d79c23dffc828939ec) and it shows that distributing the Gaussians does yield a significant improvement in performance when the number of clusters and dimensionality of the data is sufficiently large (>30 dimensions, >10 clusters). In particular, the "typical" use case of 40 dimensions and 10k clusters gains about 15 seconds in runtime when distributing the Gaussians. > GaussianMixtureEM should distribute matrix inverse for large numFeatures, k > --- > > Key: SPARK-5016 > URL: https://issues.apache.org/jira/browse/SPARK-5016 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.2.0 >Reporter: Joseph K. Bradley > Labels: clustering > > If numFeatures or k are large, GMM EM should distribute the matrix inverse > computation for Gaussian initialization. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5016) GaussianMixtureEM should distribute matrix inverse for large numFeatures, k
[ https://issues.apache.org/jira/browse/SPARK-5016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14612847#comment-14612847 ] Feynman Liang commented on SPARK-5016: -- I did some [perf testing](https://gist.github.com/feynmanliang/70d79c23dffc828939ec) and it shows that distributing the Gaussians does yield a significant improvement in performance when the number of clusters and dimensionality of the data is sufficiently large (>30 dimensions, >10 clusters). In particular, the "typical" use case of 40 dimensions and 10k clusters gains about 15 seconds in runtime when distributing the Gaussians. > GaussianMixtureEM should distribute matrix inverse for large numFeatures, k > --- > > Key: SPARK-5016 > URL: https://issues.apache.org/jira/browse/SPARK-5016 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.2.0 >Reporter: Joseph K. Bradley > Labels: clustering > > If numFeatures or k are large, GMM EM should distribute the matrix inverse > computation for Gaussian initialization. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8803) Crosstab element's can't contain null's and back ticks
[ https://issues.apache.org/jira/browse/SPARK-8803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-8803. Resolution: Fixed Assignee: Burak Yavuz Fix Version/s: 1.5.0 1.4.1 > Crosstab element's can't contain null's and back ticks > -- > > Key: SPARK-8803 > URL: https://issues.apache.org/jira/browse/SPARK-8803 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Burak Yavuz >Assignee: Burak Yavuz > Fix For: 1.4.1, 1.5.0 > > > Having back ticks or null as elements causes problems. > Since elements become column names, we have to drop them from the element as > back ticks are special characters. > Having null throws exceptions, we could replace them with empty strings. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8776) Increase the default MaxPermSize
[ https://issues.apache.org/jira/browse/SPARK-8776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai reassigned SPARK-8776: --- Assignee: Yin Huai > Increase the default MaxPermSize > > > Key: SPARK-8776 > URL: https://issues.apache.org/jira/browse/SPARK-8776 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Yin Huai >Assignee: Yin Huai > Fix For: 1.4.1, 1.5.0 > > > Since 1.4.0, Spark SQL has isolated class loaders for seperating hive > dependencies on metastore and execution, which increases the memory > consumption of PermGen. How about we increase the default size from 128m to > 256m? Seems the change we need to make is > https://github.com/apache/spark/blob/3c0156899dc1ec1f7dfe6d7c8af47fa6dc7d00bf/launcher/src/main/java/org/apache/spark/launcher/AbstractCommandBuilder.java#L139. > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8776) Increase the default MaxPermSize
[ https://issues.apache.org/jira/browse/SPARK-8776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai resolved SPARK-8776. - Resolution: Fixed Fix Version/s: 1.5.0 1.4.1 Issue resolved by pull request 7196 [https://github.com/apache/spark/pull/7196] > Increase the default MaxPermSize > > > Key: SPARK-8776 > URL: https://issues.apache.org/jira/browse/SPARK-8776 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Yin Huai > Fix For: 1.4.1, 1.5.0 > > > Since 1.4.0, Spark SQL has isolated class loaders for seperating hive > dependencies on metastore and execution, which increases the memory > consumption of PermGen. How about we increase the default size from 128m to > 256m? Seems the change we need to make is > https://github.com/apache/spark/blob/3c0156899dc1ec1f7dfe6d7c8af47fa6dc7d00bf/launcher/src/main/java/org/apache/spark/launcher/AbstractCommandBuilder.java#L139. > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8809) Remove ConvertNaNs analyzer rule
Reynold Xin created SPARK-8809: -- Summary: Remove ConvertNaNs analyzer rule Key: SPARK-8809 URL: https://issues.apache.org/jira/browse/SPARK-8809 Project: Spark Issue Type: Improvement Components: SQL Reporter: Reynold Xin Assignee: Reynold Xin Cast already handles "NaN" when casting from string to double/float. I don't think this rule is necessary anymore. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8801) Support TypeCollection in ExpectsInputTypes
[ https://issues.apache.org/jira/browse/SPARK-8801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-8801. Resolution: Fixed Fix Version/s: 1.5.0 > Support TypeCollection in ExpectsInputTypes > --- > > Key: SPARK-8801 > URL: https://issues.apache.org/jira/browse/SPARK-8801 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > Fix For: 1.5.0 > > > Some functions support more than one input types for each parameter. For > example, length supports binary and string, and maybe array/struct in the > future. > This ticket proposes a TypeCollection AbstractDataType that supports multiple > data types. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8501) ORC data source may give empty schema if an ORC file containing zero rows is picked for schema discovery
[ https://issues.apache.org/jira/browse/SPARK-8501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian resolved SPARK-8501. --- Resolution: Fixed Fixed by https://github.com/apache/spark/pull/7199 Backported to 1.4.1 by https://github.com/apache/spark/pull/7200 > ORC data source may give empty schema if an ORC file containing zero rows is > picked for schema discovery > > > Key: SPARK-8501 > URL: https://issues.apache.org/jira/browse/SPARK-8501 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.0 > Environment: Hive 0.13.1 >Reporter: Cheng Lian >Assignee: Cheng Lian >Priority: Critical > > Not sure whether this should be considered as a bug of ORC bundled with Hive > 0.13.1: for an ORC file containing zero rows, the schema written in its > footer contains zero fields (e.g. {{struct<>}}). > To reproduce this issue, let's first produce an empty ORC file. Copy data > file {{sql/hive/src/test/resources/data/files/kv1.txt}} in Spark code repo to > {{/tmp/kv1.txt}} (I just picked a random simple test data file), then run the > following lines in Hive 0.13.1 CLI: > {noformat} > $ hive > hive> CREATE TABLE foo(key INT, value STRING); > hive> LOAD DATA LOCAL INPATH '/tmp/kv1.txt' INTO TABLE foo; > hive> CREATE TABLE bar STORED AS ORC AS SELECT * FROM foo WHERE key = -1; > {noformat} > Now inspect the empty ORC file we just wrote: > {noformat} > $ hive --orcfiledump /user/hive/warehouse_hive13/bar/00_0 > Structure for /user/hive/warehouse_hive13/bar/00_0 > 15/06/20 00:42:54 INFO orc.ReaderImpl: Reading ORC rows from > /user/hive/warehouse_hive13/bar/00_0 with {include: null, offset: 0, > length: 9223372036854775807} > Rows: 0 > Compression: ZLIB > Compression size: 262144 > Type: struct<> > Stripe Statistics: > File Statistics: > Column 0: count: 0 > Stripes: > {noformat} > Notice the {{struct<>}} part. > This "feature" is OK for Hive, which has a central metastore to save table > schema. But for users who read raw data files without Hive metastore with > Spark SQL 1.4.0, it causes problem because currently the ORC data source just > picks a random part-file whichever comes the first for schema discovery. > Expected behavior can be: > # Try all files one by one until we find a part-file with non-empty schema. > # Throws {{AnalysisException}} if no such part-file can be found. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8804) order of UTF8String is wrong if there is any non-ascii character in it
[ https://issues.apache.org/jira/browse/SPARK-8804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-8804: --- Fix Version/s: 1.4.1 > order of UTF8String is wrong if there is any non-ascii character in it > --- > > Key: SPARK-8804 > URL: https://issues.apache.org/jira/browse/SPARK-8804 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.0 >Reporter: Davies Liu >Assignee: Davies Liu >Priority: Blocker > Fix For: 1.4.1 > > > We compare the UTF8String byte by byte, but byte in JVM is signed, it should > be compared as unsigned. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8213) math function: factorial
[ https://issues.apache.org/jira/browse/SPARK-8213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-8213. Resolution: Fixed Fix Version/s: 1.5.0 (still missing Python) > math function: factorial > > > Key: SPARK-8213 > URL: https://issues.apache.org/jira/browse/SPARK-8213 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: zhichao-li > Fix For: 1.5.0 > > > factorial(INT a): long > Returns the factorial of a (as of Hive 1.2.0). Valid a is [0..20]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8159) Improve SQL/DataFrame expression coverage
[ https://issues.apache.org/jira/browse/SPARK-8159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14612799#comment-14612799 ] Reynold Xin commented on SPARK-8159: I think it is ok to add them all at once. But it is also ok if there are pull requests that add a few of them at a time. Not a big deal. > Improve SQL/DataFrame expression coverage > - > > Key: SPARK-8159 > URL: https://issues.apache.org/jira/browse/SPARK-8159 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Reynold Xin > > This is an umbrella ticket to track new expressions we are adding to > SQL/DataFrame. > For each new expression, we should: > 1. Add a new Expression implementation in > org.apache.spark.sql.catalyst.expressions > 2. If applicable, implement the code generated version (by implementing > genCode). > 3. Add comprehensive unit tests (for all the data types the expressions > support). > 4. If applicable, add a new function for DataFrame in > org.apache.spark.sql.functions, and python/pyspark/sql/functions.py for > Python. > For date/time functions, put them in expressions/datetime.scala, and create a > DateTimeFunctionSuite.scala for testing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-5594) SparkException: Failed to get broadcast (TorrentBroadcast)
[ https://issues.apache.org/jira/browse/SPARK-5594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] SuYan updated SPARK-5594: - Comment: was deleted (was: Do you write sth like: object XXX { val sc = new SparkContext() def main { rdd.map { someFunc() } } def someFunc{} } Our user meet the same exception because make the sparkContext as static variable instead of a local variable.) > SparkException: Failed to get broadcast (TorrentBroadcast) > -- > > Key: SPARK-5594 > URL: https://issues.apache.org/jira/browse/SPARK-5594 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.2.0, 1.3.0 >Reporter: John Sandiford >Priority: Critical > > I am uncertain whether this is a bug, however I am getting the error below > when running on a cluster (works locally), and have no idea what is causing > it, or where to look for more information. > Any help is appreciated. Others appear to experience the same issue, but I > have not found any solutions online. > Please note that this only happens with certain code and is repeatable, all > my other spark jobs work fine. > {noformat} > ERROR TaskSetManager: Task 3 in stage 6.0 failed 4 times; aborting job > Exception in thread "main" org.apache.spark.SparkException: Job aborted due > to stage failure: Task 3 in stage 6.0 failed 4 times, most recent failure: > Lost task 3.3 in stage 6.0 (TID 24, ): java.io.IOException: > org.apache.spark.SparkException: Failed to get broadcast_6_piece0 of > broadcast_6 > at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1011) > at > org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:164) > at > org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:64) > at > org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:64) > at > org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:87) > at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:58) > at org.apache.spark.scheduler.Task.run(Task.scala:56) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:744) > Caused by: org.apache.spark.SparkException: Failed to get broadcast_6_piece0 > of broadcast_6 > at > org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1$$anonfun$2.apply(TorrentBroadcast.scala:137) > at > org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1$$anonfun$2.apply(TorrentBroadcast.scala:137) > at scala.Option.getOrElse(Option.scala:120) > at > org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply$mcVI$sp(TorrentBroadcast.scala:136) > at > org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:119) > at > org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:119) > at scala.collection.immutable.List.foreach(List.scala:318) > at > org.apache.spark.broadcast.TorrentBroadcast.org$apache$spark$broadcast$TorrentBroadcast$$readBlocks(TorrentBroadcast.scala:119) > at > org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:174) > at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1008) > ... 11 more > {noformat} > Driver stacktrace: > {noformat} > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1214) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1203) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1202) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1202) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:696) > at > org.apache.spa
[jira] [Commented] (SPARK-5594) SparkException: Failed to get broadcast (TorrentBroadcast)
[ https://issues.apache.org/jira/browse/SPARK-5594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14612786#comment-14612786 ] SuYan commented on SPARK-5594: -- Do you write sth like: object XXX { val sc = new SparkContext() def main { rdd.map { someFunc() } } def someFunc{} } Our user meet the same exception because make the sparkContext as static variable instead of a local variable. > SparkException: Failed to get broadcast (TorrentBroadcast) > -- > > Key: SPARK-5594 > URL: https://issues.apache.org/jira/browse/SPARK-5594 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.2.0, 1.3.0 >Reporter: John Sandiford >Priority: Critical > > I am uncertain whether this is a bug, however I am getting the error below > when running on a cluster (works locally), and have no idea what is causing > it, or where to look for more information. > Any help is appreciated. Others appear to experience the same issue, but I > have not found any solutions online. > Please note that this only happens with certain code and is repeatable, all > my other spark jobs work fine. > {noformat} > ERROR TaskSetManager: Task 3 in stage 6.0 failed 4 times; aborting job > Exception in thread "main" org.apache.spark.SparkException: Job aborted due > to stage failure: Task 3 in stage 6.0 failed 4 times, most recent failure: > Lost task 3.3 in stage 6.0 (TID 24, ): java.io.IOException: > org.apache.spark.SparkException: Failed to get broadcast_6_piece0 of > broadcast_6 > at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1011) > at > org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:164) > at > org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:64) > at > org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:64) > at > org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:87) > at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:58) > at org.apache.spark.scheduler.Task.run(Task.scala:56) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:744) > Caused by: org.apache.spark.SparkException: Failed to get broadcast_6_piece0 > of broadcast_6 > at > org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1$$anonfun$2.apply(TorrentBroadcast.scala:137) > at > org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1$$anonfun$2.apply(TorrentBroadcast.scala:137) > at scala.Option.getOrElse(Option.scala:120) > at > org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply$mcVI$sp(TorrentBroadcast.scala:136) > at > org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:119) > at > org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:119) > at scala.collection.immutable.List.foreach(List.scala:318) > at > org.apache.spark.broadcast.TorrentBroadcast.org$apache$spark$broadcast$TorrentBroadcast$$readBlocks(TorrentBroadcast.scala:119) > at > org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:174) > at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1008) > ... 11 more > {noformat} > Driver stacktrace: > {noformat} > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1214) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1203) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1202) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1202) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:696) >
[jira] [Commented] (SPARK-5594) SparkException: Failed to get broadcast (TorrentBroadcast)
[ https://issues.apache.org/jira/browse/SPARK-5594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14612785#comment-14612785 ] SuYan commented on SPARK-5594: -- Do you write sth like: object XXX { val sc = new SparkContext() def main { rdd.map { someFunc() } } def someFunc{} } Our user meet the same exception because make the sparkContext as static variable instead of a local variable. > SparkException: Failed to get broadcast (TorrentBroadcast) > -- > > Key: SPARK-5594 > URL: https://issues.apache.org/jira/browse/SPARK-5594 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.2.0, 1.3.0 >Reporter: John Sandiford >Priority: Critical > > I am uncertain whether this is a bug, however I am getting the error below > when running on a cluster (works locally), and have no idea what is causing > it, or where to look for more information. > Any help is appreciated. Others appear to experience the same issue, but I > have not found any solutions online. > Please note that this only happens with certain code and is repeatable, all > my other spark jobs work fine. > {noformat} > ERROR TaskSetManager: Task 3 in stage 6.0 failed 4 times; aborting job > Exception in thread "main" org.apache.spark.SparkException: Job aborted due > to stage failure: Task 3 in stage 6.0 failed 4 times, most recent failure: > Lost task 3.3 in stage 6.0 (TID 24, ): java.io.IOException: > org.apache.spark.SparkException: Failed to get broadcast_6_piece0 of > broadcast_6 > at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1011) > at > org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:164) > at > org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:64) > at > org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:64) > at > org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:87) > at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:58) > at org.apache.spark.scheduler.Task.run(Task.scala:56) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:744) > Caused by: org.apache.spark.SparkException: Failed to get broadcast_6_piece0 > of broadcast_6 > at > org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1$$anonfun$2.apply(TorrentBroadcast.scala:137) > at > org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1$$anonfun$2.apply(TorrentBroadcast.scala:137) > at scala.Option.getOrElse(Option.scala:120) > at > org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply$mcVI$sp(TorrentBroadcast.scala:136) > at > org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:119) > at > org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:119) > at scala.collection.immutable.List.foreach(List.scala:318) > at > org.apache.spark.broadcast.TorrentBroadcast.org$apache$spark$broadcast$TorrentBroadcast$$readBlocks(TorrentBroadcast.scala:119) > at > org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:174) > at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1008) > ... 11 more > {noformat} > Driver stacktrace: > {noformat} > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1214) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1203) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1202) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1202) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:696) >
[jira] [Updated] (SPARK-6980) Akka timeout exceptions indicate which conf controls them
[ https://issues.apache.org/jira/browse/SPARK-6980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Imran Rashid updated SPARK-6980: Fix Version/s: (was: 1.6.0) 1.5.0 > Akka timeout exceptions indicate which conf controls them > - > > Key: SPARK-6980 > URL: https://issues.apache.org/jira/browse/SPARK-6980 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Imran Rashid >Assignee: Bryan Cutler >Priority: Minor > Labels: starter > Fix For: 1.5.0 > > Attachments: Spark-6980-Test.scala > > > If you hit one of the akka timeouts, you just get an exception like > {code} > java.util.concurrent.TimeoutException: Futures timed out after [30 seconds] > {code} > The exception doesn't indicate how to change the timeout, though there is > usually (always?) a corresponding setting in {{SparkConf}} . It would be > nice if the exception including the relevant setting. > I think this should be pretty easy to do -- we just need to create something > like a {{NamedTimeout}}. It would have its own {{await}} method, catches the > akka timeout and throws its own exception. We should change > {{RpcUtils.askTimeout}} and {{RpcUtils.lookupTimeout}} to always give a > {{NamedTimeout}}, so we can be sure that anytime we have a timeout, we get a > better exception. > Given the latest refactoring to the rpc layer, this needs to be done in both > {{AkkaUtils}} and {{AkkaRpcEndpoint}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6980) Akka timeout exceptions indicate which conf controls them
[ https://issues.apache.org/jira/browse/SPARK-6980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Imran Rashid updated SPARK-6980: Assignee: Bryan Cutler (was: Harsh Gupta) > Akka timeout exceptions indicate which conf controls them > - > > Key: SPARK-6980 > URL: https://issues.apache.org/jira/browse/SPARK-6980 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Imran Rashid >Assignee: Bryan Cutler >Priority: Minor > Labels: starter > Fix For: 1.5.0 > > Attachments: Spark-6980-Test.scala > > > If you hit one of the akka timeouts, you just get an exception like > {code} > java.util.concurrent.TimeoutException: Futures timed out after [30 seconds] > {code} > The exception doesn't indicate how to change the timeout, though there is > usually (always?) a corresponding setting in {{SparkConf}} . It would be > nice if the exception including the relevant setting. > I think this should be pretty easy to do -- we just need to create something > like a {{NamedTimeout}}. It would have its own {{await}} method, catches the > akka timeout and throws its own exception. We should change > {{RpcUtils.askTimeout}} and {{RpcUtils.lookupTimeout}} to always give a > {{NamedTimeout}}, so we can be sure that anytime we have a timeout, we get a > better exception. > Given the latest refactoring to the rpc layer, this needs to be done in both > {{AkkaUtils}} and {{AkkaRpcEndpoint}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8808) Fix assignments in SparkR
Yu Ishikawa created SPARK-8808: -- Summary: Fix assignments in SparkR Key: SPARK-8808 URL: https://issues.apache.org/jira/browse/SPARK-8808 Project: Spark Issue Type: Sub-task Components: SparkR Reporter: Yu Ishikawa {noformat} inst/tests/test_binary_function.R:79:12: style: Use <-, not =, for assignment. mockFile = c("Spark is pretty.", "Spark is awesome.") {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6980) Akka timeout exceptions indicate which conf controls them
[ https://issues.apache.org/jira/browse/SPARK-6980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Imran Rashid resolved SPARK-6980. - Resolution: Fixed Fix Version/s: 1.6.0 Issue resolved by pull request 6205 [https://github.com/apache/spark/pull/6205] > Akka timeout exceptions indicate which conf controls them > - > > Key: SPARK-6980 > URL: https://issues.apache.org/jira/browse/SPARK-6980 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Imran Rashid >Assignee: Harsh Gupta >Priority: Minor > Labels: starter > Fix For: 1.6.0 > > Attachments: Spark-6980-Test.scala > > > If you hit one of the akka timeouts, you just get an exception like > {code} > java.util.concurrent.TimeoutException: Futures timed out after [30 seconds] > {code} > The exception doesn't indicate how to change the timeout, though there is > usually (always?) a corresponding setting in {{SparkConf}} . It would be > nice if the exception including the relevant setting. > I think this should be pretty easy to do -- we just need to create something > like a {{NamedTimeout}}. It would have its own {{await}} method, catches the > akka timeout and throws its own exception. We should change > {{RpcUtils.askTimeout}} and {{RpcUtils.lookupTimeout}} to always give a > {{NamedTimeout}}, so we can be sure that anytime we have a timeout, we get a > better exception. > Given the latest refactoring to the rpc layer, this needs to be done in both > {{AkkaUtils}} and {{AkkaRpcEndpoint}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8069) Add support for cutoff to RandomForestClassifier
[ https://issues.apache.org/jira/browse/SPARK-8069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8069: --- Assignee: (was: Apache Spark) > Add support for cutoff to RandomForestClassifier > > > Key: SPARK-8069 > URL: https://issues.apache.org/jira/browse/SPARK-8069 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: holdenk >Priority: Minor > Original Estimate: 240h > Remaining Estimate: 240h > > Consider adding support for cutoffs similar to > http://cran.r-project.org/web/packages/randomForest/randomForest.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8069) Add support for cutoff to RandomForestClassifier
[ https://issues.apache.org/jira/browse/SPARK-8069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14612771#comment-14612771 ] Apache Spark commented on SPARK-8069: - User 'holdenk' has created a pull request for this issue: https://github.com/apache/spark/pull/7205 > Add support for cutoff to RandomForestClassifier > > > Key: SPARK-8069 > URL: https://issues.apache.org/jira/browse/SPARK-8069 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: holdenk >Priority: Minor > Original Estimate: 240h > Remaining Estimate: 240h > > Consider adding support for cutoffs similar to > http://cran.r-project.org/web/packages/randomForest/randomForest.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8069) Add support for cutoff to RandomForestClassifier
[ https://issues.apache.org/jira/browse/SPARK-8069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8069: --- Assignee: Apache Spark > Add support for cutoff to RandomForestClassifier > > > Key: SPARK-8069 > URL: https://issues.apache.org/jira/browse/SPARK-8069 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: holdenk >Assignee: Apache Spark >Priority: Minor > Original Estimate: 240h > Remaining Estimate: 240h > > Consider adding support for cutoffs similar to > http://cran.r-project.org/web/packages/randomForest/randomForest.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8549) Fix the line length of SparkR
[ https://issues.apache.org/jira/browse/SPARK-8549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8549: --- Assignee: Apache Spark > Fix the line length of SparkR > - > > Key: SPARK-8549 > URL: https://issues.apache.org/jira/browse/SPARK-8549 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Reporter: Yu Ishikawa >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8549) Fix the line length of SparkR
[ https://issues.apache.org/jira/browse/SPARK-8549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8549: --- Assignee: (was: Apache Spark) > Fix the line length of SparkR > - > > Key: SPARK-8549 > URL: https://issues.apache.org/jira/browse/SPARK-8549 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Reporter: Yu Ishikawa > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8549) Fix the line length of SparkR
[ https://issues.apache.org/jira/browse/SPARK-8549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14612763#comment-14612763 ] Apache Spark commented on SPARK-8549: - User 'yu-iskw' has created a pull request for this issue: https://github.com/apache/spark/pull/7204 > Fix the line length of SparkR > - > > Key: SPARK-8549 > URL: https://issues.apache.org/jira/browse/SPARK-8549 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Reporter: Yu Ishikawa > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8497) Graph Clique(Complete Connected Sub-graph) Discovery Algorithm
[ https://issues.apache.org/jira/browse/SPARK-8497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-8497: - Assignee: Fan Jiang > Graph Clique(Complete Connected Sub-graph) Discovery Algorithm > -- > > Key: SPARK-8497 > URL: https://issues.apache.org/jira/browse/SPARK-8497 > Project: Spark > Issue Type: New Feature > Components: GraphX, ML, MLlib, Spark Core >Reporter: Fan Jiang >Assignee: Fan Jiang > Labels: features > Original Estimate: 72h > Remaining Estimate: 72h > > In recent years, social network industry has high demand on Complete > Connected Sub-Graph Discoveries, so does Telecom. Similar as the graph > connection from Twitter, the calls and other activities from telecoms world > form a huge social graph, and due to the nature of communication method, it > shows the strongest inter-person relationship, the graph based analysis will > reveal tremendous value from telecoms connections. > We need an algorithm in Spark to figure out ALL the strongest completely > connected sub-graph (so called Clique here) for EVERY person in the network > which will be one of the start point for understanding user's social > behaviour. > In Huawei, we have many real-world use cases that invovle telecom social > graph of tens billion edges and hundreds million vertices, and the cliques > will be also in tens million level. The graph will be a fast changing one > which means we need to analyse the graph pattern very often (one result per > day/week for moving time window which spans multiple months). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6487) Add sequential pattern mining algorithm to Spark MLlib
[ https://issues.apache.org/jira/browse/SPARK-6487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-6487: - Shepherd: Xiangrui Meng Target Version/s: 1.5.0 > Add sequential pattern mining algorithm to Spark MLlib > -- > > Key: SPARK-6487 > URL: https://issues.apache.org/jira/browse/SPARK-6487 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Zhang JiaJin > > [~mengxr] [~zhangyouhua] > Sequential pattern mining is an important branch in the pattern mining. In > the past the actual work, we use the sequence mining (mainly PrefixSpan > algorithm) to find the telecommunication signaling sequence pattern, achieved > good results. But once the data is too large, the operation time is too long, > even can not meet the the service requirements. We are ready to implement the > PrefixSpan algorithm in spark, and applied to our subsequent work. > The related Paper: > PrefixSpan: > Pei, Jian, et al. "Mining sequential patterns by pattern-growth: The > prefixspan approach." Knowledge and Data Engineering, IEEE Transactions on > 16.11 (2004): 1424-1440. > Parallel Algorithm: > Cong, Shengnan, Jiawei Han, and David Padua. "Parallel mining of closed > sequential patterns." Proceedings of the eleventh ACM SIGKDD international > conference on Knowledge discovery in data mining. ACM, 2005. > Distributed Algorithm: > Wei, Yong-qing, Dong Liu, and Lin-shan Duan. "Distributed PrefixSpan > algorithm based on MapReduce." Information Technology in Medicine and > Education (ITME), 2012 International Symposium on. Vol. 2. IEEE, 2012. > Pattern mining and sequential mining Knowledge: > Han, Jiawei, et al. "Frequent pattern mining: current status and future > directions." Data Mining and Knowledge Discovery 15.1 (2007): 55-86. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8572) Type coercion for ScalaUDFs
[ https://issues.apache.org/jira/browse/SPARK-8572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8572: --- Assignee: Apache Spark > Type coercion for ScalaUDFs > --- > > Key: SPARK-8572 > URL: https://issues.apache.org/jira/browse/SPARK-8572 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Yin Huai >Assignee: Apache Spark >Priority: Critical > > Seems we do not do type coercion for ScalaUDFs. The following code will hit a > runtime exception. > {code} > import org.apache.spark.sql.functions._ > val myUDF = udf((x: Int) => x + 1) > val df = sqlContext.range(1, 10).toDF("i").select(myUDF($"i")) > df.explain(true) > df.show > {code} > It is also good to check if we do type coercion for PythonUDFs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8572) Type coercion for ScalaUDFs
[ https://issues.apache.org/jira/browse/SPARK-8572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14612742#comment-14612742 ] Apache Spark commented on SPARK-8572: - User 'piaozhexiu' has created a pull request for this issue: https://github.com/apache/spark/pull/7203 > Type coercion for ScalaUDFs > --- > > Key: SPARK-8572 > URL: https://issues.apache.org/jira/browse/SPARK-8572 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Yin Huai >Priority: Critical > > Seems we do not do type coercion for ScalaUDFs. The following code will hit a > runtime exception. > {code} > import org.apache.spark.sql.functions._ > val myUDF = udf((x: Int) => x + 1) > val df = sqlContext.range(1, 10).toDF("i").select(myUDF($"i")) > df.explain(true) > df.show > {code} > It is also good to check if we do type coercion for PythonUDFs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8572) Type coercion for ScalaUDFs
[ https://issues.apache.org/jira/browse/SPARK-8572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8572: --- Assignee: (was: Apache Spark) > Type coercion for ScalaUDFs > --- > > Key: SPARK-8572 > URL: https://issues.apache.org/jira/browse/SPARK-8572 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Yin Huai >Priority: Critical > > Seems we do not do type coercion for ScalaUDFs. The following code will hit a > runtime exception. > {code} > import org.apache.spark.sql.functions._ > val myUDF = udf((x: Int) => x + 1) > val df = sqlContext.range(1, 10).toDF("i").select(myUDF($"i")) > df.explain(true) > df.show > {code} > It is also good to check if we do type coercion for PythonUDFs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8806) run-tests scala style must fail if it does not adhere to Spark Code Style Guide
[ https://issues.apache.org/jira/browse/SPARK-8806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rekha Joshi updated SPARK-8806: --- Description: ./dev/run-tests Scala Style must fail if it does not adhere to Spark Code Style Guide Spark Scala Style :https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide The scala style test passes even if it does not adhere to style guide. Now scala style pass check gives only false illusion of correctness. Alterntively if we can have spark-format.xml for IDE (intellij/eclipse) similar to hadoop-format.xml to avoid style issues? was: ./dev/run-tests Scala Style must fail if it does not adhere to Spark Code Style Guide Spark Scala Style :https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide The scala style test passes even if it does not adhere to style guide.Now scala style pass check gives only false illusion of correctness. Alterntively if we can have spark-format.xml for IDE (intellij/eclipse) similar to hadoop-format.xml to avoid style issues? > run-tests scala style must fail if it does not adhere to Spark Code Style > Guide > --- > > Key: SPARK-8806 > URL: https://issues.apache.org/jira/browse/SPARK-8806 > Project: Spark > Issue Type: Wish > Components: Build >Affects Versions: 1.5.0 >Reporter: Rekha Joshi > > ./dev/run-tests Scala Style must fail if it does not adhere to Spark Code > Style Guide > Spark Scala Style > :https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide > The scala style test passes even if it does not adhere to style guide. > Now scala style pass check gives only false illusion of correctness. > Alterntively if we can have spark-format.xml for IDE (intellij/eclipse) > similar to hadoop-format.xml to avoid style issues? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8807) Add between operator in SparkR
Yu Ishikawa created SPARK-8807: -- Summary: Add between operator in SparkR Key: SPARK-8807 URL: https://issues.apache.org/jira/browse/SPARK-8807 Project: Spark Issue Type: New Feature Components: SparkR Reporter: Yu Ishikawa Add between operator in SparkR ``` df$age between c(1, 2) ``` -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8159) Improve SQL/DataFrame expression coverage
[ https://issues.apache.org/jira/browse/SPARK-8159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14612728#comment-14612728 ] Cheng Hao commented on SPARK-8159: -- Will that possible to add all of the expressions support in a SINGLE PR for Python API and another SINGLE PR for R, after we finish all of the expressions? At least we can save the of jenkins resources compare to adding them one by one. > Improve SQL/DataFrame expression coverage > - > > Key: SPARK-8159 > URL: https://issues.apache.org/jira/browse/SPARK-8159 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Reynold Xin > > This is an umbrella ticket to track new expressions we are adding to > SQL/DataFrame. > For each new expression, we should: > 1. Add a new Expression implementation in > org.apache.spark.sql.catalyst.expressions > 2. If applicable, implement the code generated version (by implementing > genCode). > 3. Add comprehensive unit tests (for all the data types the expressions > support). > 4. If applicable, add a new function for DataFrame in > org.apache.spark.sql.functions, and python/pyspark/sql/functions.py for > Python. > For date/time functions, put them in expressions/datetime.scala, and create a > DateTimeFunctionSuite.scala for testing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8806) run-tests scala style must fail if it does not adhere to Spark Code Style Guide
[ https://issues.apache.org/jira/browse/SPARK-8806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14612727#comment-14612727 ] Rekha Joshi commented on SPARK-8806: looking into what is possible into script run-tests.thanks > run-tests scala style must fail if it does not adhere to Spark Code Style > Guide > --- > > Key: SPARK-8806 > URL: https://issues.apache.org/jira/browse/SPARK-8806 > Project: Spark > Issue Type: Wish > Components: Build >Affects Versions: 1.5.0 >Reporter: Rekha Joshi > > ./dev/run-tests Scala Style must fail if it does not adhere to Spark Code > Style Guide > Spark Scala Style > :https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide > The scala style test passes even if it does not adhere to style guide.Now > scala style pass check gives only false illusion of correctness. > Alterntively if we can have spark-format.xml for IDE (intellij/eclipse) > similar to hadoop-format.xml to avoid style issues? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8806) run-tests scala style must fail if it does not adhere to Spark Code Style Guide
Rekha Joshi created SPARK-8806: -- Summary: run-tests scala style must fail if it does not adhere to Spark Code Style Guide Key: SPARK-8806 URL: https://issues.apache.org/jira/browse/SPARK-8806 Project: Spark Issue Type: Wish Components: Build Affects Versions: 1.5.0 Reporter: Rekha Joshi ./dev/run-tests Scala Style must fail if it does not adhere to Spark Code Style Guide Spark Scala Style :https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide The scala style test passes even if it does not adhere to style guide.Now scala style pass check gives only false illusion of correctness. Alterntively if we can have spark-format.xml for IDE (intellij/eclipse) similar to hadoop-format.xml to avoid style issues? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8782) GenerateOrdering fails for NullType (i.e. ORDER BY NULL crashes)
[ https://issues.apache.org/jira/browse/SPARK-8782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-8782. Resolution: Fixed Fix Version/s: 1.5.0 > GenerateOrdering fails for NullType (i.e. ORDER BY NULL crashes) > > > Key: SPARK-8782 > URL: https://issues.apache.org/jira/browse/SPARK-8782 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 1.5.0 >Reporter: Josh Rosen >Assignee: Josh Rosen >Priority: Blocker > Fix For: 1.5.0 > > > Queries containing ORDER BY NULL currently result in a code generation > exception: > {code} > public SpecificOrdering > generate(org.apache.spark.sql.catalyst.expressions.Expression[] expr) { > return new SpecificOrdering(expr); > } > class SpecificOrdering extends > org.apache.spark.sql.catalyst.expressions.codegen.BaseOrdering { > private org.apache.spark.sql.catalyst.expressions.Expression[] > expressions = null; > public > SpecificOrdering(org.apache.spark.sql.catalyst.expressions.Expression[] expr) > { > expressions = expr; > } > @Override > public int compare(InternalRow a, InternalRow b) { > InternalRow i = null; // Holds current row being evaluated. > > i = a; > final Object primitive1 = null; > i = b; > final Object primitive3 = null; > if (true && true) { > // Nothing > } else if (true) { > return -1; > } else if (true) { > return 1; > } else { > int comp = primitive1.compare(primitive3); > if (comp != 0) { > return comp; > } > } > > return 0; > } > } > org.codehaus.commons.compiler.CompileException: Line 29, Column 43: A method > named "compare" is not declared in any enclosing class nor any supertype, nor > through a static import > at > org.codehaus.janino.UnitCompiler.compileError(UnitCompiler.java:10174) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8159) Improve SQL/DataFrame expression coverage
[ https://issues.apache.org/jira/browse/SPARK-8159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14612725#comment-14612725 ] Yu Ishikawa commented on SPARK-8159: [~rxin] How should we deal with the {{pyspark}} and {{SparkR}} versions? - Make another umblella issues for {{pyspark}} and {{SparkR}} - Make sub issues in each these issues - Reopen each issues for {{pyspark}} and {{SparkR}} And which ones should we support in {{pyspark}} and {{SparkR}}? All? > Improve SQL/DataFrame expression coverage > - > > Key: SPARK-8159 > URL: https://issues.apache.org/jira/browse/SPARK-8159 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Reynold Xin > > This is an umbrella ticket to track new expressions we are adding to > SQL/DataFrame. > For each new expression, we should: > 1. Add a new Expression implementation in > org.apache.spark.sql.catalyst.expressions > 2. If applicable, implement the code generated version (by implementing > genCode). > 3. Add comprehensive unit tests (for all the data types the expressions > support). > 4. If applicable, add a new function for DataFrame in > org.apache.spark.sql.functions, and python/pyspark/sql/functions.py for > Python. > For date/time functions, put them in expressions/datetime.scala, and create a > DateTimeFunctionSuite.scala for testing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8768) SparkSubmitSuite fails on Hadoop 1.x builds due to java.lang.VerifyError in Akka Protobuf
[ https://issues.apache.org/jira/browse/SPARK-8768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14612724#comment-14612724 ] Josh Rosen commented on SPARK-8768: --- [~zsxwing], I don't think so: the master Maven build uses build/mvn, which, as far as I know, should now be downloading the newer Maven version that is supposed to work. > SparkSubmitSuite fails on Hadoop 1.x builds due to java.lang.VerifyError in > Akka Protobuf > - > > Key: SPARK-8768 > URL: https://issues.apache.org/jira/browse/SPARK-8768 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 1.5.0 >Reporter: Josh Rosen >Priority: Blocker > > The end-to-end SparkSubmitSuite tests ("launch simple application with > spark-submit", "include jars passed in through --jars", and "include jars > passed in through --packages") are currently failing for the pre-YARN Hadoop > builds. > I managed to reproduce one of the Jenkins failures locally: > {code} > build/mvn -Phadoop-1 -Dhadoop.version=1.2.1 -Phive -Phive-thriftserver > -Pkinesis-asl test -DwildcardSuites=org.apache.spark.deploy.SparkSubmitSuite > -Dtest=none > {code} > Here's the output from unit-tests.log: > {code} > = TEST OUTPUT FOR o.a.s.deploy.SparkSubmitSuite: 'launch simple > application with spark-submit' = > 15/07/01 13:39:58.964 redirect stderr for command ./bin/spark-submit INFO > Utils: SLF4J: Class path contains multiple SLF4J bindings. > 15/07/01 13:39:58.964 redirect stderr for command ./bin/spark-submit INFO > Utils: SLF4J: Found binding in > [jar:file:/Users/joshrosen/Documents/spark-2/assembly/target/scala-2.10/spark-assembly-1.5.0-SNAPSHOT-hadoop1.2.1.jar!/org/slf4j/impl/StaticLoggerBinder.class] > 15/07/01 13:39:58.965 redirect stderr for command ./bin/spark-submit INFO > Utils: SLF4J: Found binding in > [jar:file:/Users/joshrosen/.m2/repository/org/slf4j/slf4j-log4j12/1.7.10/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class] > 15/07/01 13:39:58.965 redirect stderr for command ./bin/spark-submit INFO > Utils: SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an > explanation. > 15/07/01 13:39:58.965 redirect stderr for command ./bin/spark-submit INFO > Utils: SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] > 15/07/01 13:39:58.966 redirect stderr for command ./bin/spark-submit INFO > Utils: 15/07/01 13:39:58 INFO SparkContext: Running Spark version > 1.5.0-SNAPSHOT > 15/07/01 13:39:59.334 redirect stderr for command ./bin/spark-submit INFO > Utils: 15/07/01 13:39:59 INFO SecurityManager: Changing view acls to: > joshrosen > 15/07/01 13:39:59.335 redirect stderr for command ./bin/spark-submit INFO > Utils: 15/07/01 13:39:59 INFO SecurityManager: Changing modify acls to: > joshrosen > 15/07/01 13:39:59.335 redirect stderr for command ./bin/spark-submit INFO > Utils: 15/07/01 13:39:59 INFO SecurityManager: SecurityManager: > authentication disabled; ui acls disabled; users with view permissions: > Set(joshrosen); users with modify permissions: Set(joshrosen) > 15/07/01 13:39:59.898 redirect stderr for command ./bin/spark-submit INFO > Utils: 15/07/01 13:39:59 INFO Slf4jLogger: Slf4jLogger started > 15/07/01 13:39:59.934 redirect stderr for command ./bin/spark-submit INFO > Utils: 15/07/01 13:39:59 INFO Remoting: Starting remoting > 15/07/01 13:40:00.009 redirect stderr for command ./bin/spark-submit INFO > Utils: 15/07/01 13:40:00 ERROR ActorSystemImpl: Uncaught fatal error from > thread [sparkDriver-akka.remote.default-remote-dispatcher-5] shutting down > ActorSystem [sparkDriver] > 15/07/01 13:40:00.009 redirect stderr for command ./bin/spark-submit INFO > Utils: java.lang.VerifyError: class > akka.remote.WireFormats$AkkaControlMessage overrides final method > getUnknownFields.()Lcom/google/protobuf/UnknownFieldSet; > 15/07/01 13:40:00.009 redirect stderr for command ./bin/spark-submit INFO > Utils:at java.lang.ClassLoader.defineClass1(Native Method) > 15/07/01 13:40:00.009 redirect stderr for command ./bin/spark-submit INFO > Utils:at java.lang.ClassLoader.defineClass(ClassLoader.java:800) > 15/07/01 13:40:00.009 redirect stderr for command ./bin/spark-submit INFO > Utils:at > java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) > 15/07/01 13:40:00.010 redirect stderr for command ./bin/spark-submit INFO > Utils:at java.net.URLClassLoader.defineClass(URLClassLoader.java:449) > 15/07/01 13:40:00.010 redirect stderr for command ./bin/spark-submit INFO > Utils:at java.net.URLClassLoader.access$100(URLClassLoader.java:71) > 15/07/01 13:40:00.010 redirect stderr for command ./bin/spark-submit INFO > Utils:at java.net.URLClassLoader$1.run(URLClassLoader.java
[jira] [Commented] (SPARK-5682) Add encrypted shuffle in spark
[ https://issues.apache.org/jira/browse/SPARK-5682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14612719#comment-14612719 ] liyunzhang_intel commented on SPARK-5682: - [~hujiayin]: {quote} the AES solution is a bit heavy to encode/decode the live steaming data. {quote} Is there any other solution to encode/decode the live streaming data? please share your suggestion with us. > Add encrypted shuffle in spark > -- > > Key: SPARK-5682 > URL: https://issues.apache.org/jira/browse/SPARK-5682 > Project: Spark > Issue Type: New Feature > Components: Shuffle >Reporter: liyunzhang_intel > Attachments: Design Document of Encrypted Spark > Shuffle_20150209.docx, Design Document of Encrypted Spark > Shuffle_20150318.docx, Design Document of Encrypted Spark > Shuffle_20150402.docx, Design Document of Encrypted Spark > Shuffle_20150506.docx > > > Encrypted shuffle is enabled in hadoop 2.6 which make the process of shuffle > data safer. This feature is necessary in spark. AES is a specification for > the encryption of electronic data. There are 5 common modes in AES. CTR is > one of the modes. We use two codec JceAesCtrCryptoCodec and > OpensslAesCtrCryptoCodec to enable spark encrypted shuffle which is also used > in hadoop encrypted shuffle. JceAesCtrypoCodec uses encrypted algorithms jdk > provides while OpensslAesCtrCryptoCodec uses encrypted algorithms openssl > provides. > Because ugi credential info is used in the process of encrypted shuffle, we > first enable encrypted shuffle on spark-on-yarn framework. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7263) Add new shuffle manager which stores shuffle blocks in Parquet
[ https://issues.apache.org/jira/browse/SPARK-7263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14612686#comment-14612686 ] Matt Massie commented on SPARK-7263: I just pushed a large update to my github account. I'll have a PR submitted to the Spark project very soon. https://github.com/apache/spark/compare/master...massie:parquet-shuffle > Add new shuffle manager which stores shuffle blocks in Parquet > -- > > Key: SPARK-7263 > URL: https://issues.apache.org/jira/browse/SPARK-7263 > Project: Spark > Issue Type: New Feature > Components: Block Manager >Reporter: Matt Massie > > I have a working prototype of this feature that can be viewed at > https://github.com/apache/spark/compare/master...massie:parquet-shuffle?expand=1 > Setting the "spark.shuffle.manager" to "parquet" enables this shuffle manager. > The dictionary support that Parquet provides appreciably reduces the amount of > memory that objects use; however, once Parquet data is shuffled, all the > dictionary information is lost and the column-oriented data is written to > shuffle > blocks in a record-oriented fashion. This shuffle manager addresses this issue > by reading and writing all shuffle blocks in the Parquet format. > If shuffle objects are Avro records, then the Avro $SCHEMA is converted to > Parquet > schema and used directly, otherwise, the Parquet schema is generated via > reflection. > Currently, the only non-Avro keys supported is primitive types. The reflection > code can be improved (or replaced) to support complex records. > The ParquetShufflePair class allows the shuffle key and value to be stored in > Parquet blocks as a single record with a single schema. > This commit adds the following new Spark configuration options: > "spark.shuffle.parquet.compression" - sets the Parquet compression codec > "spark.shuffle.parquet.blocksize" - sets the Parquet block size > "spark.shuffle.parquet.pagesize" - set the Parquet page size > "spark.shuffle.parquet.enabledictionary" - turns dictionary encoding on/off > Parquet does not (and has no plans to) support a streaming API. Metadata > sections > are scattered through a Parquet file making a streaming API difficult. As > such, > the ShuffleBlockFetcherIterator has been modified to fetch the entire contents > of map outputs into temporary blocks before loading the data into the reducer. > Interesting future asides: > o There is no need to define a data serializer (although Spark requires it) > o Parquet support predicate pushdown and projection which could be used at > between shuffle stages to improve performance in the future -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5581) When writing sorted map output file, avoid open / close between each partition
[ https://issues.apache.org/jira/browse/SPARK-5581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14612676#comment-14612676 ] Matt Cheah commented on SPARK-5581: --- I'd be interested in taking something like this on =). [~joshrosen] it sounds like there are still some open questions though; can I write up a PR taking your comments into consideration and we can iterate from there? > When writing sorted map output file, avoid open / close between each partition > -- > > Key: SPARK-5581 > URL: https://issues.apache.org/jira/browse/SPARK-5581 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Affects Versions: 1.3.0 >Reporter: Sandy Ryza > > {code} > // Bypassing merge-sort; get an iterator by partition and just write > everything directly. > for ((id, elements) <- this.partitionedIterator) { > if (elements.hasNext) { > val writer = blockManager.getDiskWriter( > blockId, outputFile, ser, fileBufferSize, > context.taskMetrics.shuffleWriteMetrics.get) > for (elem <- elements) { > writer.write(elem) > } > writer.commitAndClose() > val segment = writer.fileSegment() > lengths(id) = segment.length > } > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-8784) Add python API for hex/unhex
[ https://issues.apache.org/jira/browse/SPARK-8784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin reopened SPARK-8784: Reopened due to build breaking. > Add python API for hex/unhex > > > Key: SPARK-8784 > URL: https://issues.apache.org/jira/browse/SPARK-8784 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Davies Liu >Assignee: Davies Liu > Fix For: 1.5.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7483) [MLLib] Using Kryo with FPGrowth fails with an exception
[ https://issues.apache.org/jira/browse/SPARK-7483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14612673#comment-14612673 ] S commented on SPARK-7483: -- This does NOT work. Registering the classes stops it from crashing, but produces a bug in the FP-Growth algorithm. Specifically, the frequency counts for itemsets are wrong. :( > [MLLib] Using Kryo with FPGrowth fails with an exception > > > Key: SPARK-7483 > URL: https://issues.apache.org/jira/browse/SPARK-7483 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.3.1 >Reporter: Tomasz Bartczak >Priority: Minor > > When using FPGrowth algorithm with KryoSerializer - Spark fails with > {code} > Job aborted due to stage failure: Task 0 in stage 9.0 failed 1 times, most > recent failure: Lost task 0.0 in stage 9.0 (TID 16, localhost): > com.esotericsoftware.kryo.KryoException: java.lang.IllegalArgumentException: > Can not set final scala.collection.mutable.ListBuffer field > org.apache.spark.mllib.fpm.FPTree$Summary.nodes to > scala.collection.mutable.ArrayBuffer > Serialization trace: > nodes (org.apache.spark.mllib.fpm.FPTree$Summary) > org$apache$spark$mllib$fpm$FPTree$$summaries > (org.apache.spark.mllib.fpm.FPTree) > {code} > This can be easily reproduced in spark codebase by setting > {code} > conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer") > {code} and running FPGrowthSuite. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8801) Support TypeCollection in ExpectsInputTypes
[ https://issues.apache.org/jira/browse/SPARK-8801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8801: --- Assignee: Reynold Xin (was: Apache Spark) > Support TypeCollection in ExpectsInputTypes > --- > > Key: SPARK-8801 > URL: https://issues.apache.org/jira/browse/SPARK-8801 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > > Some functions support more than one input types for each parameter. For > example, length supports binary and string, and maybe array/struct in the > future. > This ticket proposes a TypeCollection AbstractDataType that supports multiple > data types. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8801) Support TypeCollection in ExpectsInputTypes
[ https://issues.apache.org/jira/browse/SPARK-8801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8801: --- Assignee: Apache Spark (was: Reynold Xin) > Support TypeCollection in ExpectsInputTypes > --- > > Key: SPARK-8801 > URL: https://issues.apache.org/jira/browse/SPARK-8801 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Apache Spark > > Some functions support more than one input types for each parameter. For > example, length supports binary and string, and maybe array/struct in the > future. > This ticket proposes a TypeCollection AbstractDataType that supports multiple > data types. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8801) Support TypeCollection in ExpectsInputTypes
[ https://issues.apache.org/jira/browse/SPARK-8801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14612671#comment-14612671 ] Apache Spark commented on SPARK-8801: - User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/7202 > Support TypeCollection in ExpectsInputTypes > --- > > Key: SPARK-8801 > URL: https://issues.apache.org/jira/browse/SPARK-8801 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > > Some functions support more than one input types for each parameter. For > example, length supports binary and string, and maybe array/struct in the > future. > This ticket proposes a TypeCollection AbstractDataType that supports multiple > data types. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8803) Crosstab element's can't contain null's and back ticks
[ https://issues.apache.org/jira/browse/SPARK-8803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8803: --- Assignee: (was: Apache Spark) > Crosstab element's can't contain null's and back ticks > -- > > Key: SPARK-8803 > URL: https://issues.apache.org/jira/browse/SPARK-8803 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Burak Yavuz > > Having back ticks or null as elements causes problems. > Since elements become column names, we have to drop them from the element as > back ticks are special characters. > Having null throws exceptions, we could replace them with empty strings. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8803) Crosstab element's can't contain null's and back ticks
[ https://issues.apache.org/jira/browse/SPARK-8803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8803: --- Assignee: Apache Spark > Crosstab element's can't contain null's and back ticks > -- > > Key: SPARK-8803 > URL: https://issues.apache.org/jira/browse/SPARK-8803 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Burak Yavuz >Assignee: Apache Spark > > Having back ticks or null as elements causes problems. > Since elements become column names, we have to drop them from the element as > back ticks are special characters. > Having null throws exceptions, we could replace them with empty strings. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8803) Crosstab element's can't contain null's and back ticks
[ https://issues.apache.org/jira/browse/SPARK-8803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14612662#comment-14612662 ] Apache Spark commented on SPARK-8803: - User 'brkyvz' has created a pull request for this issue: https://github.com/apache/spark/pull/7201 > Crosstab element's can't contain null's and back ticks > -- > > Key: SPARK-8803 > URL: https://issues.apache.org/jira/browse/SPARK-8803 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Burak Yavuz > > Having back ticks or null as elements causes problems. > Since elements become column names, we have to drop them from the element as > back ticks are special characters. > Having null throws exceptions, we could replace them with empty strings. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7768) Make user-defined type (UDT) API public
[ https://issues.apache.org/jira/browse/SPARK-7768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14612658#comment-14612658 ] Joseph K. Bradley commented on SPARK-7768: -- Making VectorUDT public blocks on this issue, but it should probably be in a separate PR. > Make user-defined type (UDT) API public > --- > > Key: SPARK-7768 > URL: https://issues.apache.org/jira/browse/SPARK-7768 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Xiangrui Meng >Priority: Critical > > As the demand for UDTs increases beyond sparse/dense vectors in MLlib, it > would be nice to make the UDT API public in 1.5. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-8733) ML RDD.unpersist calls should use blocking = false
[ https://issues.apache.org/jira/browse/SPARK-8733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley closed SPARK-8733. Resolution: Won't Fix Per discussion on the linked PR, closing this for now. > ML RDD.unpersist calls should use blocking = false > -- > > Key: SPARK-8733 > URL: https://issues.apache.org/jira/browse/SPARK-8733 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Reporter: Joseph K. Bradley > Attachments: Screen Shot 2015-06-30 at 10.51.44 AM.png > > Original Estimate: 72h > Remaining Estimate: 72h > > MLlib uses unpersist in many places, but is not consistent about blocking vs > not. We should check through all of MLlib and change calls to use blocking = > false, unless there is a real need to block. I have run into issues with > futures timing out because of unpersist() calls, when there was no real need > for the ML method to fail. > See attached screenshot. Training succeeded, but the final unpersist during > cleanup failed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-7104) Support model save/load in Python's Word2Vec
[ https://issues.apache.org/jira/browse/SPARK-7104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-7104. -- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 6821 [https://github.com/apache/spark/pull/6821] > Support model save/load in Python's Word2Vec > > > Key: SPARK-7104 > URL: https://issues.apache.org/jira/browse/SPARK-7104 > Project: Spark > Issue Type: Sub-task > Components: MLlib, PySpark >Reporter: Joseph K. Bradley >Assignee: Yu Ishikawa >Priority: Minor > Fix For: 1.5.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7104) Support model save/load in Python's Word2Vec
[ https://issues.apache.org/jira/browse/SPARK-7104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-7104: - Assignee: Yu Ishikawa > Support model save/load in Python's Word2Vec > > > Key: SPARK-7104 > URL: https://issues.apache.org/jira/browse/SPARK-7104 > Project: Spark > Issue Type: Sub-task > Components: MLlib, PySpark >Reporter: Joseph K. Bradley >Assignee: Yu Ishikawa >Priority: Minor > Fix For: 1.5.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8805) Spark shell not working
[ https://issues.apache.org/jira/browse/SPARK-8805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Perinkulam I Ganesh updated SPARK-8805: --- Description: I am using Git Bash on windows. Installed Open jdk1.8.0_45 and spark 1.4.0 I am able to build spark and install it. But when ever I execute spark shell it gives me the following error: $ spark-shell /c/.../spark/bin/spark-class: line 76: conditional binary operator expected was: I am using Git Bash. Installed Open jdk1.8.0_45 and spark 1.4.0 I am able to build spark and install it. But when ever I execute spark shell it gives me the following error: $ spark-shell /c/.../spark/bin/spark-class: line 76: conditional binary operator expected > Spark shell not working > --- > > Key: SPARK-8805 > URL: https://issues.apache.org/jira/browse/SPARK-8805 > Project: Spark > Issue Type: Brainstorming > Components: Spark Core, Windows >Reporter: Perinkulam I Ganesh > > I am using Git Bash on windows. Installed Open jdk1.8.0_45 and spark 1.4.0 > I am able to build spark and install it. But when ever I execute spark shell > it gives me the following error: > $ spark-shell > /c/.../spark/bin/spark-class: line 76: conditional binary operator expected -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8805) Spark shell not working
[ https://issues.apache.org/jira/browse/SPARK-8805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Perinkulam I Ganesh updated SPARK-8805: --- Description: I am using Git Bash. Installed Open jdk1.8.0_45 and spark 1.4.0 I am able to build spark and install it. But when ever I execute spark shell it gives me the following error: $ spark-shell /c/.../spark/bin/spark-class: line 76: conditional binary operator expected was: I am using Git Bash. Installed Open jdk1.8.0_45 and spark 1.4.0 I am able to build spark and install it. But when ever I execute spark shell it gives me the following error: > Spark shell not working > --- > > Key: SPARK-8805 > URL: https://issues.apache.org/jira/browse/SPARK-8805 > Project: Spark > Issue Type: Brainstorming > Components: Spark Core, Windows >Reporter: Perinkulam I Ganesh > > I am using Git Bash. Installed Open jdk1.8.0_45 and spark 1.4.0 > I am able to build spark and install it. But when ever I execute spark shell > it gives me the following error: > $ spark-shell > /c/.../spark/bin/spark-class: line 76: conditional binary operator expected -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8501) ORC data source may give empty schema if an ORC file containing zero rows is picked for schema discovery
[ https://issues.apache.org/jira/browse/SPARK-8501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14612646#comment-14612646 ] Cheng Lian commented on SPARK-8501: --- Exactly. Please see my PR description here https://github.com/apache/spark/pull/7199 > ORC data source may give empty schema if an ORC file containing zero rows is > picked for schema discovery > > > Key: SPARK-8501 > URL: https://issues.apache.org/jira/browse/SPARK-8501 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.0 > Environment: Hive 0.13.1 >Reporter: Cheng Lian >Assignee: Cheng Lian >Priority: Critical > > Not sure whether this should be considered as a bug of ORC bundled with Hive > 0.13.1: for an ORC file containing zero rows, the schema written in its > footer contains zero fields (e.g. {{struct<>}}). > To reproduce this issue, let's first produce an empty ORC file. Copy data > file {{sql/hive/src/test/resources/data/files/kv1.txt}} in Spark code repo to > {{/tmp/kv1.txt}} (I just picked a random simple test data file), then run the > following lines in Hive 0.13.1 CLI: > {noformat} > $ hive > hive> CREATE TABLE foo(key INT, value STRING); > hive> LOAD DATA LOCAL INPATH '/tmp/kv1.txt' INTO TABLE foo; > hive> CREATE TABLE bar STORED AS ORC AS SELECT * FROM foo WHERE key = -1; > {noformat} > Now inspect the empty ORC file we just wrote: > {noformat} > $ hive --orcfiledump /user/hive/warehouse_hive13/bar/00_0 > Structure for /user/hive/warehouse_hive13/bar/00_0 > 15/06/20 00:42:54 INFO orc.ReaderImpl: Reading ORC rows from > /user/hive/warehouse_hive13/bar/00_0 with {include: null, offset: 0, > length: 9223372036854775807} > Rows: 0 > Compression: ZLIB > Compression size: 262144 > Type: struct<> > Stripe Statistics: > File Statistics: > Column 0: count: 0 > Stripes: > {noformat} > Notice the {{struct<>}} part. > This "feature" is OK for Hive, which has a central metastore to save table > schema. But for users who read raw data files without Hive metastore with > Spark SQL 1.4.0, it causes problem because currently the ORC data source just > picks a random part-file whichever comes the first for schema discovery. > Expected behavior can be: > # Try all files one by one until we find a part-file with non-empty schema. > # Throws {{AnalysisException}} if no such part-file can be found. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8805) Spark shell not working
[ https://issues.apache.org/jira/browse/SPARK-8805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Perinkulam I Ganesh updated SPARK-8805: --- Description: I am using Git Bash. Installed Open jdk1.8.0_45 and spark 1.4.0 I am able to build spark and install it. But when ever I execute spark shell it gives me the following error: > Spark shell not working > --- > > Key: SPARK-8805 > URL: https://issues.apache.org/jira/browse/SPARK-8805 > Project: Spark > Issue Type: Brainstorming > Components: Spark Core, Windows >Reporter: Perinkulam I Ganesh > > I am using Git Bash. Installed Open jdk1.8.0_45 and spark 1.4.0 > I am able to build spark and install it. But when ever I execute spark shell > it gives me the following error: -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8501) ORC data source may give empty schema if an ORC file containing zero rows is picked for schema discovery
[ https://issues.apache.org/jira/browse/SPARK-8501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14612643#comment-14612643 ] Zhan Zhang commented on SPARK-8501: --- Because in spark, we will not create the orc file if the record is empty. It is only happens with the ORC file created by hive, right? > ORC data source may give empty schema if an ORC file containing zero rows is > picked for schema discovery > > > Key: SPARK-8501 > URL: https://issues.apache.org/jira/browse/SPARK-8501 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.0 > Environment: Hive 0.13.1 >Reporter: Cheng Lian >Assignee: Cheng Lian >Priority: Critical > > Not sure whether this should be considered as a bug of ORC bundled with Hive > 0.13.1: for an ORC file containing zero rows, the schema written in its > footer contains zero fields (e.g. {{struct<>}}). > To reproduce this issue, let's first produce an empty ORC file. Copy data > file {{sql/hive/src/test/resources/data/files/kv1.txt}} in Spark code repo to > {{/tmp/kv1.txt}} (I just picked a random simple test data file), then run the > following lines in Hive 0.13.1 CLI: > {noformat} > $ hive > hive> CREATE TABLE foo(key INT, value STRING); > hive> LOAD DATA LOCAL INPATH '/tmp/kv1.txt' INTO TABLE foo; > hive> CREATE TABLE bar STORED AS ORC AS SELECT * FROM foo WHERE key = -1; > {noformat} > Now inspect the empty ORC file we just wrote: > {noformat} > $ hive --orcfiledump /user/hive/warehouse_hive13/bar/00_0 > Structure for /user/hive/warehouse_hive13/bar/00_0 > 15/06/20 00:42:54 INFO orc.ReaderImpl: Reading ORC rows from > /user/hive/warehouse_hive13/bar/00_0 with {include: null, offset: 0, > length: 9223372036854775807} > Rows: 0 > Compression: ZLIB > Compression size: 262144 > Type: struct<> > Stripe Statistics: > File Statistics: > Column 0: count: 0 > Stripes: > {noformat} > Notice the {{struct<>}} part. > This "feature" is OK for Hive, which has a central metastore to save table > schema. But for users who read raw data files without Hive metastore with > Spark SQL 1.4.0, it causes problem because currently the ORC data source just > picks a random part-file whichever comes the first for schema discovery. > Expected behavior can be: > # Try all files one by one until we find a part-file with non-empty schema. > # Throws {{AnalysisException}} if no such part-file can be found. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8805) Spark shell not working
[ https://issues.apache.org/jira/browse/SPARK-8805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Perinkulam I Ganesh updated SPARK-8805: --- Summary: Spark shell not working (was: I installed Git Bash) > Spark shell not working > --- > > Key: SPARK-8805 > URL: https://issues.apache.org/jira/browse/SPARK-8805 > Project: Spark > Issue Type: Brainstorming > Components: Spark Core, Windows >Reporter: Perinkulam I Ganesh > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8805) I installed Git Bash
Perinkulam I Ganesh created SPARK-8805: -- Summary: I installed Git Bash Key: SPARK-8805 URL: https://issues.apache.org/jira/browse/SPARK-8805 Project: Spark Issue Type: Brainstorming Components: Spark Core, Windows Reporter: Perinkulam I Ganesh -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8685) dataframe left joins are not working as expected in pyspark
[ https://issues.apache.org/jira/browse/SPARK-8685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14612637#comment-14612637 ] Reynold Xin commented on SPARK-8685: The problem is that Python Row doesn't allow duplicate values, because under the hood it is stored as a dict. > dataframe left joins are not working as expected in pyspark > --- > > Key: SPARK-8685 > URL: https://issues.apache.org/jira/browse/SPARK-8685 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL >Affects Versions: 1.4.0 > Environment: ubuntu 14.04 >Reporter: axel dahl >Assignee: Davies Liu > > I have the following code: > {code} > from pyspark import SQLContext > d1 = [{'name':'bob', 'country': 'usa', 'age': 1}, > {'name':'alice', 'country': 'jpn', 'age': 2}, > {'name':'carol', 'country': 'ire', 'age': 3}] > d2 = [{'name':'bob', 'country': 'usa', 'colour':'red'}, > {'name':'carol', 'country': 'ire', 'colour':'green'}] > r1 = sc.parallelize(d1) > r2 = sc.parallelize(d2) > sqlContext = SQLContext(sc) > df1 = sqlContext.createDataFrame(d1) > df2 = sqlContext.createDataFrame(d2) > df1.join(df2, (df1.name == df2.name) & (df1.country == df2.country), > 'left_outer').collect() > {code} > When I run it I get the following, (notice in the first row, all join keys > are take from the right-side and so are blanked out): > {code} > [Row(age=2, country=None, name=None, colour=None, country=None, name=None), > Row(age=1, country=u'usa', name=u'bob', colour=u'red', country=u'usa', > name=u'bob'), > Row(age=3, country=u'ire', name=u'carol', colour=u'green', country=u'ire', > name=u'alice')] > {code} > I would expect to get (though ideally without duplicate columns): > {code} > [Row(age=2, country=u'ire', name=u'alice', colour=None, country=None, > name=None), > Row(age=1, country=u'usa', name=u'bob', colour=u'red', country=u'usa', > name=u'bob'), > Row(age=3, country=u'ire', name=u'carol', colour=u'green', country=u'ire', > name=u'alice')] > {code} > The workaround for now is this rather clunky piece of code: > {code} > df2 = sqlContext.createDataFrame(d2).withColumnRenamed('name', > 'name2').withColumnRenamed('country', 'country2') > df1.join(df2, (df1.name == df2.name2) & (df1.country == df2.country2), > 'left_outer').collect() > {code} > Also, {{.show()}} works > {code} > sqlContext = SQLContext(sc) > df1 = sqlContext.createDataFrame(d1) > df2 = sqlContext.createDataFrame(d2) > df1.join(df2, (df1.name == df2.name) & (df1.country == df2.country), > 'left_outer').show() > +---+---+-+--+---+-+ > |age|country| name|colour|country| name| > +---+---+-+--+---+-+ > | 3|ire|carol| green|ire|carol| > | 2|jpn|alice| null| null| null| > | 1|usa| bob| red|usa| bob| > +---+---+-+--+---+-+ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8784) Add python API for hex/unhex
[ https://issues.apache.org/jira/browse/SPARK-8784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-8784. Resolution: Fixed Fix Version/s: 1.5.0 > Add python API for hex/unhex > > > Key: SPARK-8784 > URL: https://issues.apache.org/jira/browse/SPARK-8784 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Davies Liu >Assignee: Davies Liu > Fix For: 1.5.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8685) dataframe left joins are not working as expected in pyspark
[ https://issues.apache.org/jira/browse/SPARK-8685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-8685: Description: I have the following code: {code} from pyspark import SQLContext d1 = [{'name':'bob', 'country': 'usa', 'age': 1}, {'name':'alice', 'country': 'jpn', 'age': 2}, {'name':'carol', 'country': 'ire', 'age': 3}] d2 = [{'name':'bob', 'country': 'usa', 'colour':'red'}, {'name':'carol', 'country': 'ire', 'colour':'green'}] r1 = sc.parallelize(d1) r2 = sc.parallelize(d2) sqlContext = SQLContext(sc) df1 = sqlContext.createDataFrame(d1) df2 = sqlContext.createDataFrame(d2) df1.join(df2, (df1.name == df2.name) & (df1.country == df2.country), 'left_outer').collect() {code} When I run it I get the following, (notice in the first row, all join keys are take from the right-side and so are blanked out): {code} [Row(age=2, country=None, name=None, colour=None, country=None, name=None), Row(age=1, country=u'usa', name=u'bob', colour=u'red', country=u'usa', name=u'bob'), Row(age=3, country=u'ire', name=u'carol', colour=u'green', country=u'ire', name=u'alice')] {code} I would expect to get (though ideally without duplicate columns): {code} [Row(age=2, country=u'ire', name=u'alice', colour=None, country=None, name=None), Row(age=1, country=u'usa', name=u'bob', colour=u'red', country=u'usa', name=u'bob'), Row(age=3, country=u'ire', name=u'carol', colour=u'green', country=u'ire', name=u'alice')] {code} The workaround for now is this rather clunky piece of code: {code} df2 = sqlContext.createDataFrame(d2).withColumnRenamed('name', 'name2').withColumnRenamed('country', 'country2') df1.join(df2, (df1.name == df2.name2) & (df1.country == df2.country2), 'left_outer').collect() {code} Also, {{.show()}} works {code} sqlContext = SQLContext(sc) df1 = sqlContext.createDataFrame(d1) df2 = sqlContext.createDataFrame(d2) df1.join(df2, (df1.name == df2.name) & (df1.country == df2.country), 'left_outer').show() +---+---+-+--+---+-+ |age|country| name|colour|country| name| +---+---+-+--+---+-+ | 3|ire|carol| green|ire|carol| | 2|jpn|alice| null| null| null| | 1|usa| bob| red|usa| bob| +---+---+-+--+---+-+ {code} was: I have the following code: {code} from pyspark import SQLContext d1 = [{'name':'bob', 'country': 'usa', 'age': 1}, {'name':'alice', 'country': 'jpn', 'age': 2}, {'name':'carol', 'country': 'ire', 'age': 3}] d2 = [{'name':'bob', 'country': 'usa', 'colour':'red'}, {'name':'carol', 'country': 'ire', 'colour':'green'}] r1 = sc.parallelize(d1) r2 = sc.parallelize(d2) sqlContext = SQLContext(sc) df1 = sqlContext.createDataFrame(d1) df2 = sqlContext.createDataFrame(d2) df1.join(df2, (df1.name == df2.name) & (df1.country == df2.country), 'left_outer').collect() {code} When I run it I get the following, (notice in the first row, all join keys are take from the right-side and so are blanked out): {code} [Row(age=2, country=None, name=None, colour=None, country=None, name=None), Row(age=1, country=u'usa', name=u'bob', colour=u'red', country=u'usa', name=u'bob'), Row(age=3, country=u'ire', name=u'carol', colour=u'green', country=u'ire', name=u'alice')] {code} I would expect to get (though ideally without duplicate columns): {code} [Row(age=2, country=u'ire', name=u'alice', colour=None, country=None, name=None), Row(age=1, country=u'usa', name=u'bob', colour=u'red', country=u'usa', name=u'bob'), Row(age=3, country=u'ire', name=u'carol', colour=u'green', country=u'ire', name=u'alice')] {code} The workaround for now is this rather clunky piece of code: {code} df2 = sqlContext.createDataFrame(d2).withColumnRenamed('name', 'name2').withColumnRenamed('country', 'country2') df1.join(df2, (df1.name == df2.name2) & (df1.country == df2.country2), 'left_outer').collect() {code} > dataframe left joins are not working as expected in pyspark > --- > > Key: SPARK-8685 > URL: https://issues.apache.org/jira/browse/SPARK-8685 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL >Affects Versions: 1.4.0 > Environment: ubuntu 14.04 >Reporter: axel dahl >Assignee: Davies Liu > > I have the following code: > {code} > from pyspark import SQLContext > d1 = [{'name':'bob', 'country': 'usa', 'age': 1}, > {'name':'alice', 'country': 'jpn', 'age': 2}, > {'name':'carol', 'country': 'ire', 'age': 3}] > d2 = [{'name':'bob', 'country': 'usa', 'colour':'red'}, > {'name':'carol', 'country': 'ire', 'colour':'green'}] > r1 = sc.parallelize(d1) > r2 = sc.parallelize(d2) > sqlContext = SQLContext(sc) > df1 = sqlContext.createDataFrame(d1) > df2 = sqlContext.createDataFrame(d2) > df1.join(df2, (df1.name == df2.name) & (df1.country == df2.country), > 'left_outer').collect() > {code}
[jira] [Created] (SPARK-8804) order of UTF8String is wrong if there is any non-ascii character in it
Davies Liu created SPARK-8804: - Summary: order of UTF8String is wrong if there is any non-ascii character in it Key: SPARK-8804 URL: https://issues.apache.org/jira/browse/SPARK-8804 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.0 Reporter: Davies Liu Assignee: Davies Liu Priority: Blocker We compare the UTF8String byte by byte, but byte in JVM is signed, it should be compared as unsigned. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8685) dataframe left joins are not working as expected in pyspark
[ https://issues.apache.org/jira/browse/SPARK-8685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-8685: Fix Version/s: (was: 1.4.1) (was: 1.5.0) > dataframe left joins are not working as expected in pyspark > --- > > Key: SPARK-8685 > URL: https://issues.apache.org/jira/browse/SPARK-8685 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL >Affects Versions: 1.4.0 > Environment: ubuntu 14.04 >Reporter: axel dahl >Assignee: Davies Liu > > I have the following code: > {code} > from pyspark import SQLContext > d1 = [{'name':'bob', 'country': 'usa', 'age': 1}, > {'name':'alice', 'country': 'jpn', 'age': 2}, > {'name':'carol', 'country': 'ire', 'age': 3}] > d2 = [{'name':'bob', 'country': 'usa', 'colour':'red'}, > {'name':'carol', 'country': 'ire', 'colour':'green'}] > r1 = sc.parallelize(d1) > r2 = sc.parallelize(d2) > sqlContext = SQLContext(sc) > df1 = sqlContext.createDataFrame(d1) > df2 = sqlContext.createDataFrame(d2) > df1.join(df2, (df1.name == df2.name) & (df1.country == df2.country), > 'left_outer').collect() > {code} > When I run it I get the following, (notice in the first row, all join keys > are take from the right-side and so are blanked out): > {code} > [Row(age=2, country=None, name=None, colour=None, country=None, name=None), > Row(age=1, country=u'usa', name=u'bob', colour=u'red', country=u'usa', > name=u'bob'), > Row(age=3, country=u'ire', name=u'carol', colour=u'green', country=u'ire', > name=u'alice')] > {code} > I would expect to get (though ideally without duplicate columns): > {code} > [Row(age=2, country=u'ire', name=u'alice', colour=None, country=None, > name=None), > Row(age=1, country=u'usa', name=u'bob', colour=u'red', country=u'usa', > name=u'bob'), > Row(age=3, country=u'ire', name=u'carol', colour=u'green', country=u'ire', > name=u'alice')] > {code} > The workaround for now is this rather clunky piece of code: > {code} > df2 = sqlContext.createDataFrame(d2).withColumnRenamed('name', > 'name2').withColumnRenamed('country', 'country2') > df1.join(df2, (df1.name == df2.name2) & (df1.country == df2.country2), > 'left_outer').collect() > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8685) dataframe left joins are not working as expected in pyspark
[ https://issues.apache.org/jira/browse/SPARK-8685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-8685: --- Priority: Major (was: Critical) > dataframe left joins are not working as expected in pyspark > --- > > Key: SPARK-8685 > URL: https://issues.apache.org/jira/browse/SPARK-8685 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL >Affects Versions: 1.4.0 > Environment: ubuntu 14.04 >Reporter: axel dahl >Assignee: Davies Liu > Fix For: 1.4.1, 1.5.0 > > > I have the following code: > {code} > from pyspark import SQLContext > d1 = [{'name':'bob', 'country': 'usa', 'age': 1}, > {'name':'alice', 'country': 'jpn', 'age': 2}, > {'name':'carol', 'country': 'ire', 'age': 3}] > d2 = [{'name':'bob', 'country': 'usa', 'colour':'red'}, > {'name':'carol', 'country': 'ire', 'colour':'green'}] > r1 = sc.parallelize(d1) > r2 = sc.parallelize(d2) > sqlContext = SQLContext(sc) > df1 = sqlContext.createDataFrame(d1) > df2 = sqlContext.createDataFrame(d2) > df1.join(df2, (df1.name == df2.name) & (df1.country == df2.country), > 'left_outer').collect() > {code} > When I run it I get the following, (notice in the first row, all join keys > are take from the right-side and so are blanked out): > {code} > [Row(age=2, country=None, name=None, colour=None, country=None, name=None), > Row(age=1, country=u'usa', name=u'bob', colour=u'red', country=u'usa', > name=u'bob'), > Row(age=3, country=u'ire', name=u'carol', colour=u'green', country=u'ire', > name=u'alice')] > {code} > I would expect to get (though ideally without duplicate columns): > {code} > [Row(age=2, country=u'ire', name=u'alice', colour=None, country=None, > name=None), > Row(age=1, country=u'usa', name=u'bob', colour=u'red', country=u'usa', > name=u'bob'), > Row(age=3, country=u'ire', name=u'carol', colour=u'green', country=u'ire', > name=u'alice')] > {code} > The workaround for now is this rather clunky piece of code: > {code} > df2 = sqlContext.createDataFrame(d2).withColumnRenamed('name', > 'name2').withColumnRenamed('country', 'country2') > df1.join(df2, (df1.name == df2.name2) & (df1.country == df2.country2), > 'left_outer').collect() > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8685) dataframe left joins are not working as expected in pyspark
[ https://issues.apache.org/jira/browse/SPARK-8685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-8685: --- Target Version/s: 1.5.0 (was: 1.5.0, 1.4.2) > dataframe left joins are not working as expected in pyspark > --- > > Key: SPARK-8685 > URL: https://issues.apache.org/jira/browse/SPARK-8685 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL >Affects Versions: 1.4.0 > Environment: ubuntu 14.04 >Reporter: axel dahl >Assignee: Davies Liu >Priority: Critical > Fix For: 1.4.1, 1.5.0 > > > I have the following code: > {code} > from pyspark import SQLContext > d1 = [{'name':'bob', 'country': 'usa', 'age': 1}, > {'name':'alice', 'country': 'jpn', 'age': 2}, > {'name':'carol', 'country': 'ire', 'age': 3}] > d2 = [{'name':'bob', 'country': 'usa', 'colour':'red'}, > {'name':'carol', 'country': 'ire', 'colour':'green'}] > r1 = sc.parallelize(d1) > r2 = sc.parallelize(d2) > sqlContext = SQLContext(sc) > df1 = sqlContext.createDataFrame(d1) > df2 = sqlContext.createDataFrame(d2) > df1.join(df2, (df1.name == df2.name) & (df1.country == df2.country), > 'left_outer').collect() > {code} > When I run it I get the following, (notice in the first row, all join keys > are take from the right-side and so are blanked out): > {code} > [Row(age=2, country=None, name=None, colour=None, country=None, name=None), > Row(age=1, country=u'usa', name=u'bob', colour=u'red', country=u'usa', > name=u'bob'), > Row(age=3, country=u'ire', name=u'carol', colour=u'green', country=u'ire', > name=u'alice')] > {code} > I would expect to get (though ideally without duplicate columns): > {code} > [Row(age=2, country=u'ire', name=u'alice', colour=None, country=None, > name=None), > Row(age=1, country=u'usa', name=u'bob', colour=u'red', country=u'usa', > name=u'bob'), > Row(age=3, country=u'ire', name=u'carol', colour=u'green', country=u'ire', > name=u'alice')] > {code} > The workaround for now is this rather clunky piece of code: > {code} > df2 = sqlContext.createDataFrame(d2).withColumnRenamed('name', > 'name2').withColumnRenamed('country', 'country2') > df1.join(df2, (df1.name == df2.name2) & (df1.country == df2.country2), > 'left_outer').collect() > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8803) Crosstab element's can't contain null's and back ticks
Burak Yavuz created SPARK-8803: -- Summary: Crosstab element's can't contain null's and back ticks Key: SPARK-8803 URL: https://issues.apache.org/jira/browse/SPARK-8803 Project: Spark Issue Type: Bug Components: SQL Reporter: Burak Yavuz Having back ticks or null as elements causes problems. Since elements become column names, we have to drop them from the element as back ticks are special characters. Having null throws exceptions, we could replace them with empty strings. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8685) dataframe left joins are not working as expected in pyspark
[ https://issues.apache.org/jira/browse/SPARK-8685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-8685: Description: I have the following code: {code} from pyspark import SQLContext d1 = [{'name':'bob', 'country': 'usa', 'age': 1}, {'name':'alice', 'country': 'jpn', 'age': 2}, {'name':'carol', 'country': 'ire', 'age': 3}] d2 = [{'name':'bob', 'country': 'usa', 'colour':'red'}, {'name':'carol', 'country': 'ire', 'colour':'green'}] r1 = sc.parallelize(d1) r2 = sc.parallelize(d2) sqlContext = SQLContext(sc) df1 = sqlContext.createDataFrame(d1) df2 = sqlContext.createDataFrame(d2) df1.join(df2, (df1.name == df2.name) & (df1.country == df2.country), 'left_outer').collect() {code} When I run it I get the following, (notice in the first row, all join keys are take from the right-side and so are blanked out): {code} [Row(age=2, country=None, name=None, colour=None, country=None, name=None), Row(age=1, country=u'usa', name=u'bob', colour=u'red', country=u'usa', name=u'bob'), Row(age=3, country=u'ire', name=u'carol', colour=u'green', country=u'ire', name=u'alice')] {code} I would expect to get (though ideally without duplicate columns): {code} [Row(age=2, country=u'ire', name=u'alice', colour=None, country=None, name=None), Row(age=1, country=u'usa', name=u'bob', colour=u'red', country=u'usa', name=u'bob'), Row(age=3, country=u'ire', name=u'carol', colour=u'green', country=u'ire', name=u'alice')] {code} The workaround for now is this rather clunky piece of code: {code} df2 = sqlContext.createDataFrame(d2).withColumnRenamed('name', 'name2').withColumnRenamed('country', 'country2') df1.join(df2, (df1.name == df2.name2) & (df1.country == df2.country2), 'left_outer').collect() {code} was: I have the following code: {code} from pyspark import SQLContext d1 = [{'name':'bob', 'country': 'usa', 'age': 1}, {'name':'alice', 'country': 'jpn', 'age': 2}, {'name':'carol', 'country': 'ire', 'age': 3}] d2 = [{'name':'bob', 'country': 'usa', 'colour':'red'}, {'name':'carol', 'country': 'ire', 'colour':'green'}] r1 = sc.parallelize(d1) r2 = sc.parallelize(d2) sqlContext = SQLContext(sc) df1 = sqlContext.createDataFrame(d1) df2 = sqlContext.createDataFrame(d2) df1.join(df2, df1.name == df2.name and df1.country == df2.country, 'left_outer').collect() {code} When I run it I get the following, (notice in the first row, all join keys are take from the right-side and so are blanked out): {code} [Row(age=2, country=None, name=None, colour=None, country=None, name=None), Row(age=1, country=u'usa', name=u'bob', colour=u'red', country=u'usa', name=u'bob'), Row(age=3, country=u'ire', name=u'carol', colour=u'green', country=u'ire', name=u'alice')] {code} I would expect to get (though ideally without duplicate columns): {code} [Row(age=2, country=u'ire', name=u'alice', colour=None, country=None, name=None), Row(age=1, country=u'usa', name=u'bob', colour=u'red', country=u'usa', name=u'bob'), Row(age=3, country=u'ire', name=u'carol', colour=u'green', country=u'ire', name=u'alice')] {code} The workaround for now is this rather clunky piece of code: {code} df2 = sqlContext.createDataFrame(d2).withColumnRenamed('name', 'name2').withColumnRenamed('country', 'country2') df1.join(df2, df1.name == df2.name2 and df1.country == df2.country2, 'left_outer').collect() {code} > dataframe left joins are not working as expected in pyspark > --- > > Key: SPARK-8685 > URL: https://issues.apache.org/jira/browse/SPARK-8685 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL >Affects Versions: 1.4.0 > Environment: ubuntu 14.04 >Reporter: axel dahl >Assignee: Davies Liu >Priority: Critical > Fix For: 1.4.1, 1.5.0 > > > I have the following code: > {code} > from pyspark import SQLContext > d1 = [{'name':'bob', 'country': 'usa', 'age': 1}, > {'name':'alice', 'country': 'jpn', 'age': 2}, > {'name':'carol', 'country': 'ire', 'age': 3}] > d2 = [{'name':'bob', 'country': 'usa', 'colour':'red'}, > {'name':'carol', 'country': 'ire', 'colour':'green'}] > r1 = sc.parallelize(d1) > r2 = sc.parallelize(d2) > sqlContext = SQLContext(sc) > df1 = sqlContext.createDataFrame(d1) > df2 = sqlContext.createDataFrame(d2) > df1.join(df2, (df1.name == df2.name) & (df1.country == df2.country), > 'left_outer').collect() > {code} > When I run it I get the following, (notice in the first row, all join keys > are take from the right-side and so are blanked out): > {code} > [Row(age=2, country=None, name=None, colour=None, country=None, name=None), > Row(age=1, country=u'usa', name=u'bob', colour=u'red', country=u'usa', > name=u'bob'), > Row(age=3, country=u'ire', name=u'carol', colour=u'green', country=u'ire', > name=u'alice')] > {code} > I would expect to get (though ideal
[jira] [Reopened] (SPARK-8685) dataframe left joins are not working as expected in pyspark
[ https://issues.apache.org/jira/browse/SPARK-8685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai reopened SPARK-8685: - > dataframe left joins are not working as expected in pyspark > --- > > Key: SPARK-8685 > URL: https://issues.apache.org/jira/browse/SPARK-8685 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL >Affects Versions: 1.4.0 > Environment: ubuntu 14.04 >Reporter: axel dahl >Assignee: Davies Liu >Priority: Critical > Fix For: 1.4.1, 1.5.0 > > > I have the following code: > {code} > from pyspark import SQLContext > d1 = [{'name':'bob', 'country': 'usa', 'age': 1}, > {'name':'alice', 'country': 'jpn', 'age': 2}, > {'name':'carol', 'country': 'ire', 'age': 3}] > d2 = [{'name':'bob', 'country': 'usa', 'colour':'red'}, > {'name':'carol', 'country': 'ire', 'colour':'green'}] > r1 = sc.parallelize(d1) > r2 = sc.parallelize(d2) > sqlContext = SQLContext(sc) > df1 = sqlContext.createDataFrame(d1) > df2 = sqlContext.createDataFrame(d2) > df1.join(df2, df1.name == df2.name and df1.country == df2.country, > 'left_outer').collect() > {code} > When I run it I get the following, (notice in the first row, all join keys > are take from the right-side and so are blanked out): > {code} > [Row(age=2, country=None, name=None, colour=None, country=None, name=None), > Row(age=1, country=u'usa', name=u'bob', colour=u'red', country=u'usa', > name=u'bob'), > Row(age=3, country=u'ire', name=u'carol', colour=u'green', country=u'ire', > name=u'alice')] > {code} > I would expect to get (though ideally without duplicate columns): > {code} > [Row(age=2, country=u'ire', name=u'alice', colour=None, country=None, > name=None), > Row(age=1, country=u'usa', name=u'bob', colour=u'red', country=u'usa', > name=u'bob'), > Row(age=3, country=u'ire', name=u'carol', colour=u'green', country=u'ire', > name=u'alice')] > {code} > The workaround for now is this rather clunky piece of code: > {code} > df2 = sqlContext.createDataFrame(d2).withColumnRenamed('name', > 'name2').withColumnRenamed('country', 'country2') > df1.join(df2, df1.name == df2.name2 and df1.country == df2.country2, > 'left_outer').collect() > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8501) ORC data source may give empty schema if an ORC file containing zero rows is picked for schema discovery
[ https://issues.apache.org/jira/browse/SPARK-8501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-8501: -- Target Version/s: 1.4.1, 1.5.0 (was: 1.5.0, 1.4.2) > ORC data source may give empty schema if an ORC file containing zero rows is > picked for schema discovery > > > Key: SPARK-8501 > URL: https://issues.apache.org/jira/browse/SPARK-8501 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.0 > Environment: Hive 0.13.1 >Reporter: Cheng Lian >Assignee: Cheng Lian >Priority: Critical > > Not sure whether this should be considered as a bug of ORC bundled with Hive > 0.13.1: for an ORC file containing zero rows, the schema written in its > footer contains zero fields (e.g. {{struct<>}}). > To reproduce this issue, let's first produce an empty ORC file. Copy data > file {{sql/hive/src/test/resources/data/files/kv1.txt}} in Spark code repo to > {{/tmp/kv1.txt}} (I just picked a random simple test data file), then run the > following lines in Hive 0.13.1 CLI: > {noformat} > $ hive > hive> CREATE TABLE foo(key INT, value STRING); > hive> LOAD DATA LOCAL INPATH '/tmp/kv1.txt' INTO TABLE foo; > hive> CREATE TABLE bar STORED AS ORC AS SELECT * FROM foo WHERE key = -1; > {noformat} > Now inspect the empty ORC file we just wrote: > {noformat} > $ hive --orcfiledump /user/hive/warehouse_hive13/bar/00_0 > Structure for /user/hive/warehouse_hive13/bar/00_0 > 15/06/20 00:42:54 INFO orc.ReaderImpl: Reading ORC rows from > /user/hive/warehouse_hive13/bar/00_0 with {include: null, offset: 0, > length: 9223372036854775807} > Rows: 0 > Compression: ZLIB > Compression size: 262144 > Type: struct<> > Stripe Statistics: > File Statistics: > Column 0: count: 0 > Stripes: > {noformat} > Notice the {{struct<>}} part. > This "feature" is OK for Hive, which has a central metastore to save table > schema. But for users who read raw data files without Hive metastore with > Spark SQL 1.4.0, it causes problem because currently the ORC data source just > picks a random part-file whichever comes the first for schema discovery. > Expected behavior can be: > # Try all files one by one until we find a part-file with non-empty schema. > # Throws {{AnalysisException}} if no such part-file can be found. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-8685) dataframe left joins are not working as expected in pyspark
[ https://issues.apache.org/jira/browse/SPARK-8685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin closed SPARK-8685. -- Resolution: Duplicate Assignee: Davies Liu Fix Version/s: 1.5.0 1.4.1 This is already fixed. > dataframe left joins are not working as expected in pyspark > --- > > Key: SPARK-8685 > URL: https://issues.apache.org/jira/browse/SPARK-8685 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL >Affects Versions: 1.4.0 > Environment: ubuntu 14.04 >Reporter: axel dahl >Assignee: Davies Liu >Priority: Critical > Fix For: 1.4.1, 1.5.0 > > > I have the following code: > {code} > from pyspark import SQLContext > d1 = [{'name':'bob', 'country': 'usa', 'age': 1}, > {'name':'alice', 'country': 'jpn', 'age': 2}, > {'name':'carol', 'country': 'ire', 'age': 3}] > d2 = [{'name':'bob', 'country': 'usa', 'colour':'red'}, > {'name':'carol', 'country': 'ire', 'colour':'green'}] > r1 = sc.parallelize(d1) > r2 = sc.parallelize(d2) > sqlContext = SQLContext(sc) > df1 = sqlContext.createDataFrame(d1) > df2 = sqlContext.createDataFrame(d2) > df1.join(df2, df1.name == df2.name and df1.country == df2.country, > 'left_outer').collect() > {code} > When I run it I get the following, (notice in the first row, all join keys > are take from the right-side and so are blanked out): > {code} > [Row(age=2, country=None, name=None, colour=None, country=None, name=None), > Row(age=1, country=u'usa', name=u'bob', colour=u'red', country=u'usa', > name=u'bob'), > Row(age=3, country=u'ire', name=u'carol', colour=u'green', country=u'ire', > name=u'alice')] > {code} > I would expect to get (though ideally without duplicate columns): > {code} > [Row(age=2, country=u'ire', name=u'alice', colour=None, country=None, > name=None), > Row(age=1, country=u'usa', name=u'bob', colour=u'red', country=u'usa', > name=u'bob'), > Row(age=3, country=u'ire', name=u'carol', colour=u'green', country=u'ire', > name=u'alice')] > {code} > The workaround for now is this rather clunky piece of code: > {code} > df2 = sqlContext.createDataFrame(d2).withColumnRenamed('name', > 'name2').withColumnRenamed('country', 'country2') > df1.join(df2, df1.name == df2.name2 and df1.country == df2.country2, > 'left_outer').collect() > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8573) For PySpark's DataFrame API, we need to throw exceptions when users try to use and/or/not
[ https://issues.apache.org/jira/browse/SPARK-8573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-8573: --- Fix Version/s: 1.5.0 1.4.1 > For PySpark's DataFrame API, we need to throw exceptions when users try to > use and/or/not > - > > Key: SPARK-8573 > URL: https://issues.apache.org/jira/browse/SPARK-8573 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL >Affects Versions: 1.3.0 >Reporter: Yin Huai >Assignee: Davies Liu >Priority: Critical > Fix For: 1.4.1, 1.5.0 > > > In PySpark's DataFrame API, we have > {code} > # `and`, `or`, `not` cannot be overloaded in Python, > # so use bitwise operators as boolean operators > __and__ = _bin_op('and') > __or__ = _bin_op('or') > __invert__ = _func_op('not') > __rand__ = _bin_op("and") > __ror__ = _bin_op("or") > {code} > Right now, users can still use operators like {{and}}, which can cause very > confusing behaviors. We need to throw an error when users try to use them and > let them know what is the right way to do. > For example, > {code} > df = sqlContext.range(1, 10) > df.id > 5 or df.id < 10 > Out[30]: Column<(id > 5)> > df.id > 5 and df.id < 10 > Out[31]: Column<(id < 10)> > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8685) dataframe left joins are not working as expected in pyspark
[ https://issues.apache.org/jira/browse/SPARK-8685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14612626#comment-14612626 ] Michael Armbrust commented on SPARK-8685: - I think the problem here is that you are using {{and}}, but instead should write {{(df1.name == df2.name) & (df1.country == df2.country)}} > dataframe left joins are not working as expected in pyspark > --- > > Key: SPARK-8685 > URL: https://issues.apache.org/jira/browse/SPARK-8685 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL >Affects Versions: 1.4.0 > Environment: ubuntu 14.04 >Reporter: axel dahl >Priority: Critical > > I have the following code: > {code} > from pyspark import SQLContext > d1 = [{'name':'bob', 'country': 'usa', 'age': 1}, > {'name':'alice', 'country': 'jpn', 'age': 2}, > {'name':'carol', 'country': 'ire', 'age': 3}] > d2 = [{'name':'bob', 'country': 'usa', 'colour':'red'}, > {'name':'carol', 'country': 'ire', 'colour':'green'}] > r1 = sc.parallelize(d1) > r2 = sc.parallelize(d2) > sqlContext = SQLContext(sc) > df1 = sqlContext.createDataFrame(d1) > df2 = sqlContext.createDataFrame(d2) > df1.join(df2, df1.name == df2.name and df1.country == df2.country, > 'left_outer').collect() > {code} > When I run it I get the following, (notice in the first row, all join keys > are take from the right-side and so are blanked out): > {code} > [Row(age=2, country=None, name=None, colour=None, country=None, name=None), > Row(age=1, country=u'usa', name=u'bob', colour=u'red', country=u'usa', > name=u'bob'), > Row(age=3, country=u'ire', name=u'carol', colour=u'green', country=u'ire', > name=u'alice')] > {code} > I would expect to get (though ideally without duplicate columns): > {code} > [Row(age=2, country=u'ire', name=u'alice', colour=None, country=None, > name=None), > Row(age=1, country=u'usa', name=u'bob', colour=u'red', country=u'usa', > name=u'bob'), > Row(age=3, country=u'ire', name=u'carol', colour=u'green', country=u'ire', > name=u'alice')] > {code} > The workaround for now is this rather clunky piece of code: > {code} > df2 = sqlContext.createDataFrame(d2).withColumnRenamed('name', > 'name2').withColumnRenamed('country', 'country2') > df1.join(df2, df1.name == df2.name2 and df1.country == df2.country2, > 'left_outer').collect() > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8501) ORC data source may give empty schema if an ORC file containing zero rows is picked for schema discovery
[ https://issues.apache.org/jira/browse/SPARK-8501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14612624#comment-14612624 ] Apache Spark commented on SPARK-8501: - User 'liancheng' has created a pull request for this issue: https://github.com/apache/spark/pull/7200 > ORC data source may give empty schema if an ORC file containing zero rows is > picked for schema discovery > > > Key: SPARK-8501 > URL: https://issues.apache.org/jira/browse/SPARK-8501 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.0 > Environment: Hive 0.13.1 >Reporter: Cheng Lian >Assignee: Cheng Lian >Priority: Critical > > Not sure whether this should be considered as a bug of ORC bundled with Hive > 0.13.1: for an ORC file containing zero rows, the schema written in its > footer contains zero fields (e.g. {{struct<>}}). > To reproduce this issue, let's first produce an empty ORC file. Copy data > file {{sql/hive/src/test/resources/data/files/kv1.txt}} in Spark code repo to > {{/tmp/kv1.txt}} (I just picked a random simple test data file), then run the > following lines in Hive 0.13.1 CLI: > {noformat} > $ hive > hive> CREATE TABLE foo(key INT, value STRING); > hive> LOAD DATA LOCAL INPATH '/tmp/kv1.txt' INTO TABLE foo; > hive> CREATE TABLE bar STORED AS ORC AS SELECT * FROM foo WHERE key = -1; > {noformat} > Now inspect the empty ORC file we just wrote: > {noformat} > $ hive --orcfiledump /user/hive/warehouse_hive13/bar/00_0 > Structure for /user/hive/warehouse_hive13/bar/00_0 > 15/06/20 00:42:54 INFO orc.ReaderImpl: Reading ORC rows from > /user/hive/warehouse_hive13/bar/00_0 with {include: null, offset: 0, > length: 9223372036854775807} > Rows: 0 > Compression: ZLIB > Compression size: 262144 > Type: struct<> > Stripe Statistics: > File Statistics: > Column 0: count: 0 > Stripes: > {noformat} > Notice the {{struct<>}} part. > This "feature" is OK for Hive, which has a central metastore to save table > schema. But for users who read raw data files without Hive metastore with > Spark SQL 1.4.0, it causes problem because currently the ORC data source just > picks a random part-file whichever comes the first for schema discovery. > Expected behavior can be: > # Try all files one by one until we find a part-file with non-empty schema. > # Throws {{AnalysisException}} if no such part-file can be found. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8632) Poor Python UDF performance because of RDD caching
[ https://issues.apache.org/jira/browse/SPARK-8632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14612608#comment-14612608 ] Justin Uang commented on SPARK-8632: Haven't gotten around to it yet. I'll let you know when I can find time to work on it! Right now, I'm thinking about creating a separate code path for sql udfs, since I realized that the current system with two threads is necessary because the RDD interface is from Iterator -> Iterator. Any type of synchronous batching won't work with RDDs that change the length of the output iterator. > Poor Python UDF performance because of RDD caching > -- > > Key: SPARK-8632 > URL: https://issues.apache.org/jira/browse/SPARK-8632 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 1.4.0 >Reporter: Justin Uang > > {quote} > We have been running into performance problems using Python UDFs with > DataFrames at large scale. > From the implementation of BatchPythonEvaluation, it looks like the goal was > to reuse the PythonRDD code. It caches the entire child RDD so that it can do > two passes over the data. One to give to the PythonRDD, then one to join the > python lambda results with the original row (which may have java objects that > should be passed through). > In addition, it caches all the columns, even the ones that don't need to be > processed by the Python UDF. In the cases I was working with, I had a 500 > column table, and i wanted to use a python UDF for one column, and it ended > up caching all 500 columns. > {quote} > http://apache-spark-developers-list.1001551.n3.nabble.com/Python-UDF-performance-at-large-scale-td12843.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8632) Poor Python UDF performance because of RDD caching
[ https://issues.apache.org/jira/browse/SPARK-8632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-8632: Target Version/s: 1.5.0 > Poor Python UDF performance because of RDD caching > -- > > Key: SPARK-8632 > URL: https://issues.apache.org/jira/browse/SPARK-8632 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 1.4.0 >Reporter: Justin Uang > > {quote} > We have been running into performance problems using Python UDFs with > DataFrames at large scale. > From the implementation of BatchPythonEvaluation, it looks like the goal was > to reuse the PythonRDD code. It caches the entire child RDD so that it can do > two passes over the data. One to give to the PythonRDD, then one to join the > python lambda results with the original row (which may have java objects that > should be passed through). > In addition, it caches all the columns, even the ones that don't need to be > processed by the Python UDF. In the cases I was working with, I had a 500 > column table, and i wanted to use a python UDF for one column, and it ended > up caching all 500 columns. > {quote} > http://apache-spark-developers-list.1001551.n3.nabble.com/Python-UDF-performance-at-large-scale-td12843.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8632) Poor Python UDF performance because of RDD caching
[ https://issues.apache.org/jira/browse/SPARK-8632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-8632: Shepherd: Davies Liu > Poor Python UDF performance because of RDD caching > -- > > Key: SPARK-8632 > URL: https://issues.apache.org/jira/browse/SPARK-8632 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 1.4.0 >Reporter: Justin Uang > > {quote} > We have been running into performance problems using Python UDFs with > DataFrames at large scale. > From the implementation of BatchPythonEvaluation, it looks like the goal was > to reuse the PythonRDD code. It caches the entire child RDD so that it can do > two passes over the data. One to give to the PythonRDD, then one to join the > python lambda results with the original row (which may have java objects that > should be passed through). > In addition, it caches all the columns, even the ones that don't need to be > processed by the Python UDF. In the cases I was working with, I had a 500 > column table, and i wanted to use a python UDF for one column, and it ended > up caching all 500 columns. > {quote} > http://apache-spark-developers-list.1001551.n3.nabble.com/Python-UDF-performance-at-large-scale-td12843.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8501) ORC data source may give empty schema if an ORC file containing zero rows is picked for schema discovery
[ https://issues.apache.org/jira/browse/SPARK-8501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8501: --- Assignee: Cheng Lian (was: Apache Spark) > ORC data source may give empty schema if an ORC file containing zero rows is > picked for schema discovery > > > Key: SPARK-8501 > URL: https://issues.apache.org/jira/browse/SPARK-8501 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.0 > Environment: Hive 0.13.1 >Reporter: Cheng Lian >Assignee: Cheng Lian >Priority: Critical > > Not sure whether this should be considered as a bug of ORC bundled with Hive > 0.13.1: for an ORC file containing zero rows, the schema written in its > footer contains zero fields (e.g. {{struct<>}}). > To reproduce this issue, let's first produce an empty ORC file. Copy data > file {{sql/hive/src/test/resources/data/files/kv1.txt}} in Spark code repo to > {{/tmp/kv1.txt}} (I just picked a random simple test data file), then run the > following lines in Hive 0.13.1 CLI: > {noformat} > $ hive > hive> CREATE TABLE foo(key INT, value STRING); > hive> LOAD DATA LOCAL INPATH '/tmp/kv1.txt' INTO TABLE foo; > hive> CREATE TABLE bar STORED AS ORC AS SELECT * FROM foo WHERE key = -1; > {noformat} > Now inspect the empty ORC file we just wrote: > {noformat} > $ hive --orcfiledump /user/hive/warehouse_hive13/bar/00_0 > Structure for /user/hive/warehouse_hive13/bar/00_0 > 15/06/20 00:42:54 INFO orc.ReaderImpl: Reading ORC rows from > /user/hive/warehouse_hive13/bar/00_0 with {include: null, offset: 0, > length: 9223372036854775807} > Rows: 0 > Compression: ZLIB > Compression size: 262144 > Type: struct<> > Stripe Statistics: > File Statistics: > Column 0: count: 0 > Stripes: > {noformat} > Notice the {{struct<>}} part. > This "feature" is OK for Hive, which has a central metastore to save table > schema. But for users who read raw data files without Hive metastore with > Spark SQL 1.4.0, it causes problem because currently the ORC data source just > picks a random part-file whichever comes the first for schema discovery. > Expected behavior can be: > # Try all files one by one until we find a part-file with non-empty schema. > # Throws {{AnalysisException}} if no such part-file can be found. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8501) ORC data source may give empty schema if an ORC file containing zero rows is picked for schema discovery
[ https://issues.apache.org/jira/browse/SPARK-8501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14612599#comment-14612599 ] Apache Spark commented on SPARK-8501: - User 'liancheng' has created a pull request for this issue: https://github.com/apache/spark/pull/7199 > ORC data source may give empty schema if an ORC file containing zero rows is > picked for schema discovery > > > Key: SPARK-8501 > URL: https://issues.apache.org/jira/browse/SPARK-8501 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.0 > Environment: Hive 0.13.1 >Reporter: Cheng Lian >Assignee: Cheng Lian >Priority: Critical > > Not sure whether this should be considered as a bug of ORC bundled with Hive > 0.13.1: for an ORC file containing zero rows, the schema written in its > footer contains zero fields (e.g. {{struct<>}}). > To reproduce this issue, let's first produce an empty ORC file. Copy data > file {{sql/hive/src/test/resources/data/files/kv1.txt}} in Spark code repo to > {{/tmp/kv1.txt}} (I just picked a random simple test data file), then run the > following lines in Hive 0.13.1 CLI: > {noformat} > $ hive > hive> CREATE TABLE foo(key INT, value STRING); > hive> LOAD DATA LOCAL INPATH '/tmp/kv1.txt' INTO TABLE foo; > hive> CREATE TABLE bar STORED AS ORC AS SELECT * FROM foo WHERE key = -1; > {noformat} > Now inspect the empty ORC file we just wrote: > {noformat} > $ hive --orcfiledump /user/hive/warehouse_hive13/bar/00_0 > Structure for /user/hive/warehouse_hive13/bar/00_0 > 15/06/20 00:42:54 INFO orc.ReaderImpl: Reading ORC rows from > /user/hive/warehouse_hive13/bar/00_0 with {include: null, offset: 0, > length: 9223372036854775807} > Rows: 0 > Compression: ZLIB > Compression size: 262144 > Type: struct<> > Stripe Statistics: > File Statistics: > Column 0: count: 0 > Stripes: > {noformat} > Notice the {{struct<>}} part. > This "feature" is OK for Hive, which has a central metastore to save table > schema. But for users who read raw data files without Hive metastore with > Spark SQL 1.4.0, it causes problem because currently the ORC data source just > picks a random part-file whichever comes the first for schema discovery. > Expected behavior can be: > # Try all files one by one until we find a part-file with non-empty schema. > # Throws {{AnalysisException}} if no such part-file can be found. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3382) GradientDescent convergence tolerance
[ https://issues.apache.org/jira/browse/SPARK-3382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-3382: - Assignee: Kai Sasaki > GradientDescent convergence tolerance > - > > Key: SPARK-3382 > URL: https://issues.apache.org/jira/browse/SPARK-3382 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.1.0 >Reporter: Joseph K. Bradley >Assignee: Kai Sasaki >Priority: Minor > Fix For: 1.5.0 > > > GradientDescent should support a convergence tolerance setting. In general, > for optimization, convergence tolerance should be preferred over a limit on > the number of iterations since it is a somewhat data-adaptive or > data-specific convergence criterion. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8501) ORC data source may give empty schema if an ORC file containing zero rows is picked for schema discovery
[ https://issues.apache.org/jira/browse/SPARK-8501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8501: --- Assignee: Apache Spark (was: Cheng Lian) > ORC data source may give empty schema if an ORC file containing zero rows is > picked for schema discovery > > > Key: SPARK-8501 > URL: https://issues.apache.org/jira/browse/SPARK-8501 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.0 > Environment: Hive 0.13.1 >Reporter: Cheng Lian >Assignee: Apache Spark >Priority: Critical > > Not sure whether this should be considered as a bug of ORC bundled with Hive > 0.13.1: for an ORC file containing zero rows, the schema written in its > footer contains zero fields (e.g. {{struct<>}}). > To reproduce this issue, let's first produce an empty ORC file. Copy data > file {{sql/hive/src/test/resources/data/files/kv1.txt}} in Spark code repo to > {{/tmp/kv1.txt}} (I just picked a random simple test data file), then run the > following lines in Hive 0.13.1 CLI: > {noformat} > $ hive > hive> CREATE TABLE foo(key INT, value STRING); > hive> LOAD DATA LOCAL INPATH '/tmp/kv1.txt' INTO TABLE foo; > hive> CREATE TABLE bar STORED AS ORC AS SELECT * FROM foo WHERE key = -1; > {noformat} > Now inspect the empty ORC file we just wrote: > {noformat} > $ hive --orcfiledump /user/hive/warehouse_hive13/bar/00_0 > Structure for /user/hive/warehouse_hive13/bar/00_0 > 15/06/20 00:42:54 INFO orc.ReaderImpl: Reading ORC rows from > /user/hive/warehouse_hive13/bar/00_0 with {include: null, offset: 0, > length: 9223372036854775807} > Rows: 0 > Compression: ZLIB > Compression size: 262144 > Type: struct<> > Stripe Statistics: > File Statistics: > Column 0: count: 0 > Stripes: > {noformat} > Notice the {{struct<>}} part. > This "feature" is OK for Hive, which has a central metastore to save table > schema. But for users who read raw data files without Hive metastore with > Spark SQL 1.4.0, it causes problem because currently the ORC data source just > picks a random part-file whichever comes the first for schema discovery. > Expected behavior can be: > # Try all files one by one until we find a part-file with non-empty schema. > # Throws {{AnalysisException}} if no such part-file can be found. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3382) GradientDescent convergence tolerance
[ https://issues.apache.org/jira/browse/SPARK-3382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-3382. -- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 3636 [https://github.com/apache/spark/pull/3636] > GradientDescent convergence tolerance > - > > Key: SPARK-3382 > URL: https://issues.apache.org/jira/browse/SPARK-3382 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.1.0 >Reporter: Joseph K. Bradley >Priority: Minor > Fix For: 1.5.0 > > > GradientDescent should support a convergence tolerance setting. In general, > for optimization, convergence tolerance should be preferred over a limit on > the number of iterations since it is a somewhat data-adaptive or > data-specific convergence criterion. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8746) Need to update download link for Hive 0.13.1 jars (HiveComparisonTest)
[ https://issues.apache.org/jira/browse/SPARK-8746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14612584#comment-14612584 ] Christian Kadner commented on SPARK-8746: - Thank you Sean! > Need to update download link for Hive 0.13.1 jars (HiveComparisonTest) > -- > > Key: SPARK-8746 > URL: https://issues.apache.org/jira/browse/SPARK-8746 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.0 >Reporter: Christian Kadner >Assignee: Christian Kadner >Priority: Trivial > Labels: documentation, test > Fix For: 1.4.1, 1.5.0 > > Original Estimate: 1h > Remaining Estimate: 1h > > The Spark SQL documentation (https://github.com/apache/spark/tree/master/sql) > describes how to generate golden answer files for new hive comparison test > cases. However the download link for the Hive 0.13.1 jars points to > https://hive.apache.org/downloads.html but none of the linked mirror sites > still has the 0.13.1 version. > We need to update the link to > https://archive.apache.org/dist/hive/hive-0.13.1/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8792) Add Python API for PCA transformer
[ https://issues.apache.org/jira/browse/SPARK-8792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-8792: - Target Version/s: 1.5.0 > Add Python API for PCA transformer > -- > > Key: SPARK-8792 > URL: https://issues.apache.org/jira/browse/SPARK-8792 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 1.5.0 >Reporter: Yanbo Liang > > Add Python API for PCA transformer -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8780) Move Python doctest code example from models to algorithms
[ https://issues.apache.org/jira/browse/SPARK-8780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14612566#comment-14612566 ] Joseph K. Bradley commented on SPARK-8780: -- Changing priority to minor for now. This would be nice, but there is a lot of other stuff to do first. > Move Python doctest code example from models to algorithms > -- > > Key: SPARK-8780 > URL: https://issues.apache.org/jira/browse/SPARK-8780 > Project: Spark > Issue Type: Sub-task > Components: MLlib, PySpark >Affects Versions: 1.5.0 >Reporter: Yanbo Liang >Priority: Minor > > Almost all doctest code examples are in the models at Pyspark mllib. > Since users usually start with algorithms rather than models, we need to move > them from models to algorithms. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8780) Move Python doctest code example from models to algorithms
[ https://issues.apache.org/jira/browse/SPARK-8780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-8780: - Priority: Minor (was: Major) > Move Python doctest code example from models to algorithms > -- > > Key: SPARK-8780 > URL: https://issues.apache.org/jira/browse/SPARK-8780 > Project: Spark > Issue Type: Sub-task > Components: MLlib, PySpark >Affects Versions: 1.5.0 >Reporter: Yanbo Liang >Priority: Minor > > Almost all doctest code examples are in the models at Pyspark mllib. > Since users usually start with algorithms rather than models, we need to move > them from models to algorithms. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8780) Move Python doctest code example from models to algorithms
[ https://issues.apache.org/jira/browse/SPARK-8780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-8780: - Issue Type: Sub-task (was: Improvement) Parent: SPARK-6173 > Move Python doctest code example from models to algorithms > -- > > Key: SPARK-8780 > URL: https://issues.apache.org/jira/browse/SPARK-8780 > Project: Spark > Issue Type: Sub-task > Components: MLlib, PySpark >Affects Versions: 1.5.0 >Reporter: Yanbo Liang > > Almost all doctest code examples are in the models at Pyspark mllib. > Since users usually start with algorithms rather than models, we need to move > them from models to algorithms. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8802) Decimal.apply(BigDecimal).toBigDecimal may throw NumberFormatException
[ https://issues.apache.org/jira/browse/SPARK-8802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14612547#comment-14612547 ] Apache Spark commented on SPARK-8802: - User 'JoshRosen' has created a pull request for this issue: https://github.com/apache/spark/pull/7198 > Decimal.apply(BigDecimal).toBigDecimal may throw NumberFormatException > -- > > Key: SPARK-8802 > URL: https://issues.apache.org/jira/browse/SPARK-8802 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Josh Rosen >Assignee: Josh Rosen >Priority: Minor > > There exist certain BigDecimals that can be converted into Spark SQL's > Decimal class but which produce Decimals that cannot be converted back to > BigDecimal without throwing NumberFormatException. > For instance: > {code} > val x = BigDecimal(BigInt("18889465931478580854784"), -2147483648) > assert(Decimal(x).toBigDecimal === x) > {code} > will fail with an exception: > {code} > java.lang.NumberFormatException > at java.math.BigDecimal.(BigDecimal.java:511) > at java.math.BigDecimal.(BigDecimal.java:757) > at scala.math.BigDecimal$.apply(BigDecimal.scala:119) > at scala.math.BigDecimal.apply(BigDecimal.scala:324) > at org.apache.spark.sql.types.Decimal.toBigDecimal(Decimal.scala:142) > at > org.apache.spark.sql.types.decimal.DecimalSuite$$anonfun$2.apply$mcV$sp(DecimalSuite.scala:62) > at > org.apache.spark.sql.types.decimal.DecimalSuite$$anonfun$2.apply(DecimalSuite.scala:60) > at > org.apache.spark.sql.types.decimal.DecimalSuite$$anonfun$2.apply(DecimalSuite.scala:60) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org