[jira] [Commented] (SPARK-37800) TreeNode.argString incorrectly formats arguments of type Set[_]

2022-01-01 Thread Simeon Simeonov (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17467526#comment-17467526
 ] 

Simeon Simeonov commented on SPARK-37800:
-

[~hyukjin.kwon] Done: https://github.com/apache/spark/pull/35084

> TreeNode.argString incorrectly formats arguments of type Set[_]
> ---
>
> Key: SPARK-37800
> URL: https://issues.apache.org/jira/browse/SPARK-37800
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Simeon Simeonov
>Priority: Minor
>
> The implementation of {{argString}} uses the following pattern for sets:
>  
> {code:java}
> case set: Set[_] =>
>   // Sort elements for deterministic behaviours
>   val sortedSeq = set.toSeq.map(formatArg(_, maxFields).sorted)   
>
>   truncatedString(sortedSeq, "{", ", ", "}", maxFields) :: Nil {code}
> Instead of sorting the elements of the set, the implementation sorts the 
> characters of the strings that {{formatArg}} returns. 
> The fix is simply to move the closing parenthesis to the correct location:
> {code:java}
>   val sortedSeq = set.toSeq.map(formatArg(_, maxFields)).sorted
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37800) TreeNode.argString incorrectly formats arguments of type Set[_]

2022-01-01 Thread Simeon Simeonov (Jira)
Simeon Simeonov created SPARK-37800:
---

 Summary: TreeNode.argString incorrectly formats arguments of type 
Set[_]
 Key: SPARK-37800
 URL: https://issues.apache.org/jira/browse/SPARK-37800
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.2.0
Reporter: Simeon Simeonov


The implementation of {{argString}} uses the following pattern for sets:

 
{code:java}
case set: Set[_] =>
  // Sort elements for deterministic behaviours
  val sortedSeq = set.toSeq.map(formatArg(_, maxFields).sorted) 
 
  truncatedString(sortedSeq, "{", ", ", "}", maxFields) :: Nil {code}
Instead of sorting the elements of the set, the implementation sorts the 
characters of the strings that {{formatArg}} returns. 

The fix is simply to move the closing parenthesis to the correct location:
{code:java}
  val sortedSeq = set.toSeq.map(formatArg(_, maxFields)).sorted
{code}
 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-35097) Add column name to SparkUpgradeException about ancient datetime

2021-04-18 Thread Simeon Simeonov (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17324643#comment-17324643
 ] 

Simeon Simeonov edited comment on SPARK-35097 at 4/19/21, 1:58 AM:
---

[~maxgekk] thanks for creating this issue; it came from a problem we 
discovered. 

Reporting the column name alone in the exception message would not be 
sufficient for fast root cause analysis in many situations. Imagine the case 
where the column name is something like {{date}}, which is common across many 
tables, and a job that uses lots of different tables. To aid users, one needs 
to narrow down the source of the problem by either (a) providing user stack 
trace information or, if that is for some reason impossible or very difficult, 
(b) provide information about the Parquet source with the issue (path, plan 
info, etc.).

[~angerszhuuu] can either (a) or (b) above be added to your PR?


was (Author: simeons):
[~maxgekk] thanks for creating this issue; it came from a problem we 
discovered. 

Reporting the column name alone in the exception message would not be 
sufficient for fast root cause analysis in many situations. Imagine the case 
where the column name is something like {{date}}, which is common across many 
tables, and a job that uses lots of different tables. To aid users, one needs 
to narrow down the source of the problem by either (a) providing user stack 
trace information or, if that is for some reason impossible or very difficult, 
(b) provide information about the Parquet parquet source with the issue (path, 
plan info, etc.).

[~angerszhuuu] can either (a) or (b) above be added to your PR?

> Add column name to SparkUpgradeException about ancient datetime
> ---
>
> Key: SPARK-35097
> URL: https://issues.apache.org/jira/browse/SPARK-35097
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Max Gekk
>Priority: Major
>
> The error message:
> {code:java}
> org.apache.spark.SparkUpgradeException: You may get a different result due to 
> the upgrading of Spark 3.0: reading dates before 1582-10-15 or timestamps 
> before 1900-01-01T00:00:00Z from Parquet files can be ambiguous, as the files 
> may be written by Spark 2.x or legacy versions of Hive, which uses a legacy 
> hybrid calendar that is different from Spark 3.0+'s Proleptic Gregorian 
> calendar. See more details in SPARK-31404. You can set 
> spark.sql.legacy.parquet.datetimeRebaseModeInRead to 'LEGACY' to rebase the 
> datetime values w.r.t. the calendar difference during reading. Or set 
> spark.sql.legacy.parquet.datetimeRebaseModeInRead to 'CORRECTED' to read the 
> datetime values as it is.
> {code}
> doesn't have any clues of which column causes the issue. Need to improve the 
> message and add column name to it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35097) Add column name to SparkUpgradeException about ancient datetime

2021-04-18 Thread Simeon Simeonov (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17324643#comment-17324643
 ] 

Simeon Simeonov commented on SPARK-35097:
-

[~maxgekk] thanks for creating this issue; it came from a problem we 
discovered. 

Reporting the column name alone in the exception message would not be 
sufficient for fast root cause analysis in many situations. Imagine the case 
where the column name is something like {{date}}, which is common across many 
tables, and a job that uses lots of different tables. To aid users, one needs 
to narrow down the source of the problem by either (a) providing user stack 
trace information or, if that is for some reason impossible or very difficult, 
(b) provide information about the Parquet parquet source with the issue (path, 
plan info, etc.).

[~angerszhuuu] can either (a) or (b) above be added to your PR?

> Add column name to SparkUpgradeException about ancient datetime
> ---
>
> Key: SPARK-35097
> URL: https://issues.apache.org/jira/browse/SPARK-35097
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Max Gekk
>Priority: Major
>
> The error message:
> {code:java}
> org.apache.spark.SparkUpgradeException: You may get a different result due to 
> the upgrading of Spark 3.0: reading dates before 1582-10-15 or timestamps 
> before 1900-01-01T00:00:00Z from Parquet files can be ambiguous, as the files 
> may be written by Spark 2.x or legacy versions of Hive, which uses a legacy 
> hybrid calendar that is different from Spark 3.0+'s Proleptic Gregorian 
> calendar. See more details in SPARK-31404. You can set 
> spark.sql.legacy.parquet.datetimeRebaseModeInRead to 'LEGACY' to rebase the 
> datetime values w.r.t. the calendar difference during reading. Or set 
> spark.sql.legacy.parquet.datetimeRebaseModeInRead to 'CORRECTED' to read the 
> datetime values as it is.
> {code}
> doesn't have any clues of which column causes the issue. Need to improve the 
> message and add column name to it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27790) Support ANSI SQL INTERVAL types

2021-03-18 Thread Simeon Simeonov (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-27790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17304526#comment-17304526
 ] 

Simeon Simeonov commented on SPARK-27790:
-

Maxim, this is good stuff. 

Does ANSI SQL allow operations on dates using the YEAR-MONTH interval type? I 
didn't see that mentioned in Milestone 1.

> Support ANSI SQL INTERVAL types
> ---
>
> Key: SPARK-27790
> URL: https://issues.apache.org/jira/browse/SPARK-27790
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Max Gekk
>Assignee: Apache Spark
>Priority: Major
>
> Spark has an INTERVAL data type, but it is “broken”:
> # It cannot be persisted
> # It is not comparable because it crosses the month day line. That is there 
> is no telling whether “1 Month 1 Day” is equal to “1 Month 1 Day” since not 
> all months have the same number of days.
> I propose here to introduce the two flavours of INTERVAL as described in the 
> ANSI SQL Standard and deprecate the Sparks interval type.
> * ANSI describes two non overlapping “classes”: 
> ** YEAR-MONTH, 
> ** DAY-SECOND ranges
> * Members within each class can be compared and sorted.
> * Supports datetime arithmetic
> * Can be persisted.
> The old and new flavors of INTERVAL can coexist until Spark INTERVAL is 
> eventually retired. Also any semantic “breakage” can be controlled via legacy 
> config settings. 
> *Milestone 1* --  Spark Interval equivalency (   The new interval types meet 
> or exceed all function of the existing SQL Interval):
> * Add two new DataType implementations for interval year-month and 
> day-second. Includes the JSON format and DLL string.
> * Infra support: check the caller sides of DateType/TimestampType
> * Support the two new interval types in Dataset/UDF.
> * Interval literals (with a legacy config to still allow mixed year-month 
> day-seconds fields and return legacy interval values)
> * Interval arithmetic(interval * num, interval / num, interval +/- interval)
> * Datetime functions/operators: Datetime - Datetime (to days or day second), 
> Datetime +/- interval
> * Cast to and from the new two interval types, cast string to interval, cast 
> interval to string (pretty printing), with the SQL syntax to specify the types
> * Support sorting intervals.
> *Milestone 2* -- Persistence:
> * Ability to create tables of type interval
> * Ability to write to common file formats such as Parquet and JSON.
> * INSERT, SELECT, UPDATE, MERGE
> * Discovery
> *Milestone 3* --  Client support
> * JDBC support
> * Hive Thrift server
> *Milestone 4* -- PySpark and Spark R integration
> * Python UDF can take and return intervals
> * DataFrame support



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32630) Reduce user confusion and subtle bugs by optionally preventing date & timestamp comparison

2020-08-16 Thread Simeon Simeonov (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17178624#comment-17178624
 ] 

Simeon Simeonov commented on SPARK-32630:
-

[~rxin] fyi, one of the subtle issues that add friction to new users being 
safely productive on the platform.

> Reduce user confusion and subtle bugs by optionally preventing date & 
> timestamp comparison
> --
>
> Key: SPARK-32630
> URL: https://issues.apache.org/jira/browse/SPARK-32630
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Simeon Simeonov
>Priority: Major
>  Labels: comparison, sql, timestamps
>
> https://issues.apache.org/jira/browse/SPARK-23549 made Spark's handling of 
> date vs. timestamp comparison consistent with SQL, which, unfortunately, 
> isn't consistent with common sense.
> When dates are compared with timestamps, they are promoted to timestamps at 
> midnight of the date, in the server timezone, which is almost always UTC. 
> This only works well if all timestamps in the data are logically time 
> instants as opposed to dates + times, which only become instants with a known 
> timezone.
> The fundamental issue is that dates are a human time concept and instant are 
> a machine time concept. While we can technically promote one to the other, 
> logically, it only works 100% if midnight for all dates in the system is in 
> the server timezone. 
> Every major modern platform offers a clear distinction between machine time 
> (instants) and human time (an instant with a timezone, UTC offset, etc.), 
> because we have learned the hard way that date & time handling is a 
> never-ending source of confusion and bugs. SQL, being an ancient language 
> (40+ years old), is well behind software engineering best practices; using it 
> as a guiding light is necessary for Spark to win market share, but 
> unfortunate in every other way.
> For example, Java has:
>  * java.time.LocalDate
>  * java.time.Instant
>  * java.time.ZonedDateTime
>  * java.time.OffsetDateTime
> I am not suggesting we add new data types to Spark. I am suggesting we go to 
> the heart of the matter, which is that most date vs. time handling issues are 
> the result of confusion or carelessness.
> What about introducing a new setting that makes comparisons between dates and 
> timestamps illegal, preferably with a helpful exception message?
> If it existed, I would certainly make it the default for all our clusters. 
> The minor coding convenience that comes from being able to compare dates & 
> timestamps with an automatic type promotion pales in comparison with the risk 
> of subtle bugs that remain undetected for a long time.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32630) Reduce user confusion and subtle bugs by optionally preventing date & timestamp comparison

2020-08-16 Thread Simeon Simeonov (Jira)
Simeon Simeonov created SPARK-32630:
---

 Summary: Reduce user confusion and subtle bugs by optionally 
preventing date & timestamp comparison
 Key: SPARK-32630
 URL: https://issues.apache.org/jira/browse/SPARK-32630
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Simeon Simeonov


https://issues.apache.org/jira/browse/SPARK-23549 made Spark's handling of date 
vs. timestamp comparison consistent with SQL, which, unfortunately, isn't 
consistent with common sense.

When dates are compared with timestamps, they are promoted to timestamps at 
midnight of the date, in the server timezone, which is almost always UTC. This 
only works well if all timestamps in the data are logically time instants as 
opposed to dates + times, which only become instants with a known timezone.

The fundamental issue is that dates are a human time concept and instant are a 
machine time concept. While we can technically promote one to the other, 
logically, it only works 100% if midnight for all dates in the system is in the 
server timezone. 

Every major modern platform offers a clear distinction between machine time 
(instants) and human time (an instant with a timezone, UTC offset, etc.), 
because we have learned the hard way that date & time handling is a 
never-ending source of confusion and bugs. SQL, being an ancient language (40+ 
years old), is well behind software engineering best practices; using it as a 
guiding light is necessary for Spark to win market share, but unfortunate in 
every other way.

For example, Java has:
 * java.time.LocalDate
 * java.time.Instant
 * java.time.ZonedDateTime
 * java.time.OffsetDateTime

I am not suggesting we add new data types to Spark. I am suggesting we go to 
the heart of the matter, which is that most date vs. time handling issues are 
the result of confusion or carelessness.

What about introducing a new setting that makes comparisons between dates and 
timestamps illegal, preferably with a helpful exception message?

If it existed, I would certainly make it the default for all our clusters. The 
minor coding convenience that comes from being able to compare dates & 
timestamps with an automatic type promotion pales in comparison with the risk 
of subtle bugs that remain undetected for a long time.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-30127) UDF should work for case class like Dataset operations

2019-12-11 Thread Simeon Simeonov (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16993750#comment-16993750
 ] 

Simeon Simeonov edited comment on SPARK-30127 at 12/11/19 6:01 PM:
---

The ability to transform one or more columns with native code, ignoring the 
rest of the schema, is sorely missed. Some may think that Dataset operations 
such as {{map}}/{{flatMap}} could be used to work around the need for this 
feature. That's true only in the cases where the Scala type of the full schema 
is (a) known in advance and (b) unchanging, which is impractical in many 
real-world use cases. Even in the cases where {{map}}/{{flatMap}} could work, 
there will be a performance cost to converting the entire row to/from internal 
row format, as opposed to just the columns that are needed.

However, UDFs are only one modality for exposing this capability and, given the 
Scala registration requirement for the UDFs, not necessarily the best one. If 
we add this capability for UDFs, I would suggest we also enhance the Dataset 
API with column-level {{map}}/{{flatMap}} functionality, e.g.,
{code:scala}
def flatMapColumn[C: Encoder, U: Encoder](colName: String, resultColName: 
String)
  (func: C => TraversableOnce[U]): Dataset[U]
{code}
While multiple columns can be passed in using {{functions.struct(col1, col2, 
...)}} and mapped to {{C}} that is {{TupleN}}, if that costs additional 
processing (internal buffer copying, serialization/deserialization), it would 
be trivial (and transparent to users if we rename {{flatMapColumn}} to 
{{flatMapColumns}} and {{colName}} to {{colName1}} above) to add versions for 2 
and 3 columns, which would cover 99+% of all uses:
{code:scala}
def flatMapColumns[C1, C2, U](colName1: String, colName2: String, 
resultColName: String)
  (func: (C1, C2) => TraversableOnce[U])
  (implicit evC: Encoder[(C1, C2)], evU: Encoder[U]): Dataset[U]

def flatMapColumns[C1, C2, C3, U](colName1: String, colName2: String, colName3: 
String, resultColName: String)
  (func: (C1, C2, C3) => TraversableOnce[U])
  (implicit evC: Encoder[(C1, C2, C3)], evU: Encoder[U]): Dataset[U]
{code}
[~cloud_fan] There are at least three benefits to adding this capability.
 # It provides a fundamental missing capability to the Dataset API: 
transforming data while knowing only part of the schema.
 # It makes use from Java more convenient, without the need for {{TypeTag}}, 
while making it consistent with {{map}}/{{flatMap}} behavior (via 
{{MapFunction}}/{{FlatmapFunction}}). Given Java's popularity, this is a big 
plus.
 # Unless I am mistaken, it may allow for more optimization than using UDFs.


was (Author: simeons):
The ability to transform one or more columns with native code, ignoring the 
rest of the schema, is sorely missed. Some may think that Dataset operations 
such as {{map}}/{{flatMap}} could be used to work around the need for this 
feature. That's true only in the cases where the Scala type of the full schema 
is (a) known in advance and (b) unchanging, which is impractical in many 
real-world use cases. Even in the cases where {{map}}/{{flatMap}} could work, 
there will be a performance cost to converting the entire row to/from internal 
row format, as opposed to just the columns that are needed.

However, UDFs are only one modality for exposing this capability and, given the 
Scala registration requirement for the UDFs, not necessarily the best one. If 
we add this capability for UDFs, I would suggest we also enhance the Dataset 
API with column-level {{map}}/{{flatMap}} functionality, e.g.,
{code:scala}
def flatMapColumn[C: Encoder, U: Encoder](colName: String)(func: C => 
TraversableOnce[U]): Dataset[U]
{code}
While multiple columns can be passed in using {{functions.struct(col1, col2, 
...)}} and mapped to {{C}} that is {{TupleN}}, if that costs additional 
processing (internal buffer copying, serialization/deserialization), it would 
be trivial (and transparent to users if we rename {{flatMapColumn}} to 
{{flatMapColumns}} and {{colName}} to {{colName1}} above) to add versions for 2 
and 3 columns, which would cover 99+% of all uses:
{code:scala}
def flatMapColumns[C1, C2, U](colName1: String, colName2: String)
  (func: (C1, C2) => TraversableOnce[U])
  (implicit evC: Encoder[(C1, C2)], evU: Encoder[U]): Dataset[U]

def flatMapColumns[C1, C2, C3, U](colName1: String, colName2: String, colName3: 
String)
  (func: (C1, C2, C3) => TraversableOnce[U])
  (implicit evC: Encoder[(C1, C2, C3)], evU: Encoder[U]): Dataset[U]
{code}
[~cloud_fan] There are at least three benefits to adding this capability.
 # It provides a fundamental missing capability to the Dataset API: 
transforming data while knowing only part of the schema.
 # It makes use from Java more convenient, without the need for {{TypeTag}}, 
while making it consistent with {{map}}/{{flatMap}} behavior (via 

[jira] [Comment Edited] (SPARK-30127) UDF should work for case class like Dataset operations

2019-12-11 Thread Simeon Simeonov (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16993750#comment-16993750
 ] 

Simeon Simeonov edited comment on SPARK-30127 at 12/11/19 5:42 PM:
---

The ability to transform one or more columns with native code, ignoring the 
rest of the schema, is sorely missed. Some may think that Dataset operations 
such as {{map}}/{{flatMap}} could be used to work around the need for this 
feature. That's true only in the cases where the Scala type of the full schema 
is (a) known in advance and (b) unchanging, which is impractical in many 
real-world use cases. Even in the cases where {{map}}/{{flatMap}} could work, 
there will be a performance cost to converting the entire row to/from internal 
row format, as opposed to just the columns that are needed.

However, UDFs are only one modality for exposing this capability and, given the 
Scala registration requirement for the UDFs, not necessarily the best one. If 
we add this capability for UDFs, I would suggest we also enhance the Dataset 
API with column-level {{map}}/{{flatMap}} functionality, e.g.,
{code:scala}
def flatMapColumn[C: Encoder, U: Encoder](colName: String)(func: C => 
TraversableOnce[U]): Dataset[U]
{code}
While multiple columns can be passed in using {{functions.struct(col1, col2, 
...)}} and mapped to {{C}} that is {{TupleN}}, if that costs additional 
processing (internal buffer copying, serialization/deserialization), it would 
be trivial (and transparent to users if we rename {{flatMapColumn}} to 
{{flatMapColumns}} and {{colName}} to {{colName1}} above) to add versions for 2 
and 3 columns, which would cover 99+% of all uses:
{code:scala}
def flatMapColumns[C1, C2, U](colName1: String, colName2: String)
  (func: (C1, C2) => TraversableOnce[U])
  (implicit evC: Encoder[(C1, C2)], evU: Encoder[U]): Dataset[U]

def flatMapColumns[C1, C2, C3, U](colName1: String, colName2: String, colName3: 
String)
  (func: (C1, C2, C3) => TraversableOnce[U])
  (implicit evC: Encoder[(C1, C2, C3)], evU: Encoder[U]): Dataset[U]
{code}
[~cloud_fan] There are at least three benefits to adding this capability.
 # It provides a fundamental missing capability to the Dataset API: 
transforming data while knowing only part of the schema.
 # It makes use from Java more convenient, without the need for {{TypeTag}}, 
while making it consistent with {{map}}/{{flatMap}} behavior (via 
{{MapFunction}}/{{FlatmapFunction}}). Given Java's popularity, this is a big 
plus.
 # Unless I am mistaken, it may allow for more optimization than using UDFs.


was (Author: simeons):
The ability to transform one or more columns with native code, ignoring the 
rest of the schema, is sorely missed. Some may think that Dataset operations 
such as {{map}}/{{flatMap}} could be used to work around the need for this 
feature. That's true only in the cases where the Scala type of the full schema 
is (a) known in advance and (b) unchanging, which is impractical in many 
real-world use cases. Even in the cases where {{map}}/{{flatMap}} could work, 
there will be a performance cost to converting the entire row to/from internal 
row format, as opposed to just the columns that are needed.

However, UDFs are only one modality for exposing this capability and, given the 
Scala registration requirement for the UDFs, not necessarily the best one. If 
we add this capability for UDFs, I would suggest we also enhance the Dataset 
API with column-level {{map}}/{{flatMap}} functionality, e.g.,
{code:scala}
def flatMapColumns[C: Encoder, U: Encoder](colName: String)(func: C => 
TraversableOnce[U]): Dataset[U]
{code}
While multiple columns can be passed in using {{functions.struct(col1, col2, 
...)}} and mapped to {{C}} that is {{TupleN}}, if that costs additional 
processing (internal buffer copying, serialization/deserialization), it would 
be trivial (and transparent to users if we rename {{colName}} above to 
{{colName1}}) to add versions for 2 and 3 columns, which would cover 99+% of 
all uses:
{code:scala}
def flatMapColumns[C1, C2, U](colName1: String, colName2: String)
  (func: (C1, C2) => TraversableOnce[U])
  (implicit evC: Encoder[(C1, C2)], evU: Encoder[U]): Dataset[U]

def flatMapColumns[C1, C2, C3, U](colName1: String, colName2: String, colName3: 
String)
  (func: (C1, C2, C3) => TraversableOnce[U])
  (implicit evC: Encoder[(C1, C2, C3)], evU: Encoder[U]): Dataset[U]
{code}
[~cloud_fan] There are at least three benefits to adding this capability.
 # It provides a fundamental missing capability to the Dataset API: 
transforming data while knowing only part of the schema.
 # It makes use from Java more convenient, without the need for {{TypeTag}}, 
while making it consistent with {{map}}/{{flatMap}} behavior (via 
{{MapFunction}}/{{FlatmapFunction}}). Given Java's popularity, this is a big 
plus.
 # Unless I am mistaken, it may allow for more optimization than 

[jira] [Commented] (SPARK-30127) UDF should work for case class like Dataset operations

2019-12-11 Thread Simeon Simeonov (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16993750#comment-16993750
 ] 

Simeon Simeonov commented on SPARK-30127:
-

The ability to transform one or more columns with native code, ignoring the 
rest of the schema, is sorely missed. Some may think that Dataset operations 
such as {{map}}/{{flatMap}} could be used to work around the need for this 
feature. That's true only in the cases where the Scala type of the full schema 
is (a) known in advance and (b) unchanging, which is impractical in many 
real-world use cases. Even in the cases where {{map}}/{{flatMap}} could work, 
there will be a performance cost to converting the entire row to/from internal 
row format, as opposed to just the columns that are needed.

However, UDFs are only one modality for exposing this capability and, given the 
Scala registration requirement for the UDFs, not necessarily the best one. If 
we add this capability for UDFs, I would suggest we also enhance the Dataset 
API with column-level {{map}}/{{flatMap}} functionality, e.g.,
{code:scala}
def flatMapColumns[C: Encoder, U: Encoder](colName: String)(func: C => 
TraversableOnce[U]): Dataset[U]
{code}
While multiple columns can be passed in using {{functions.struct(col1, col2, 
...)}} and mapped to {{C}} that is {{TupleN}}, if that costs additional 
processing (internal buffer copying, serialization/deserialization), it would 
be trivial (and transparent to users if we rename {{colName}} above to 
{{colName1}}) to add versions for 2 and 3 columns, which would cover 99+% of 
all uses:
{code:scala}
def flatMapColumns[C1, C2, U](colName1: String, colName2: String)
  (func: (C1, C2) => TraversableOnce[U])
  (implicit evC: Encoder[(C1, C2)], evU: Encoder[U]): Dataset[U]

def flatMapColumns[C1, C2, C3, U](colName1: String, colName2: String, colName3: 
String)
  (func: (C1, C2, C3) => TraversableOnce[U])
  (implicit evC: Encoder[(C1, C2, C3)], evU: Encoder[U]): Dataset[U]
{code}
[~cloud_fan] There are at least three benefits to adding this capability.
 # It provides a fundamental missing capability to the Dataset API: 
transforming data while knowing only part of the schema.
 # It makes use from Java more convenient, without the need for {{TypeTag}}, 
while making it consistent with {{map}}/{{flatMap}} behavior (via 
{{MapFunction}}/{{FlatmapFunction}}). Given Java's popularity, this is a big 
plus.
 # Unless I am mistaken, it may allow for more optimization than using UDFs.

> UDF should work for case class like Dataset operations
> --
>
> Key: SPARK-30127
> URL: https://issues.apache.org/jira/browse/SPARK-30127
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Priority: Major
>
> Currently, Spark UDF can only work on data types like java.lang.String, 
> o.a.s.sql.Row, Seq[_], etc. This is inconvenient if you want to apply an 
> operation on one column, and the column is struct type. You must access data 
> from a Row object, instead of your domain object like Dataset operations. It 
> will be great if UDF can work on types that are supported by Dataset, e.g. 
> case classes.
> Note that, there are multiple ways to register a UDF, and it's only possible 
> to support this feature if the UDF is registered using Scala API that 
> provides type tag, e.g. `def udf[RT: TypeTag, A1: TypeTag](f: Function1[A1, 
> RT])`



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26696) Dataset encoder should be publicly accessible

2019-01-22 Thread Simeon Simeonov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16749518#comment-16749518
 ] 

Simeon Simeonov commented on SPARK-26696:
-

[PR with improvement|https://github.com/apache/spark/pull/23620]

> Dataset encoder should be publicly accessible
> -
>
> Key: SPARK-26696
> URL: https://issues.apache.org/jira/browse/SPARK-26696
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Simeon Simeonov
>Priority: Major
>  Labels: dataset, encoding
>
> As a platform, Spark should enable framework developers to accomplish outside 
> of the Spark codebase much of what can be accomplished inside the Spark 
> codebase. One of the obstacles to this is a historical pattern of excessive 
> data hiding in Spark, e.g., {{expr}} in {{Column}} not being accessible. This 
> issue is an example of this pattern when it comes to {{Dataset}}.
> Consider a transformation with the signature `def foo[A](ds: Dataset[A]): 
> Dataset[A]`, which requires the use of {{toDF()}}. To get back to 
> {{Dataset[A]}} would require calling {{.as[A]}}, which requires an implicit 
> {{Encoder[A]}}. A naive approach would change the function signature to 
> `foo[A : Encoder]` but this is poor API design that requires unnecessarily 
> carrying of implicits from user code into framework code. We know 
> `Encoder[A]` exists because we have access to an instance of `Dataset[A]`... 
> but its `encoder` is not accessible.
> The solution is simple: make {{encoder}} a {{@transient val}} just as is the 
> case with {{queryExecution}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26696) Dataset encoder should be publicly accessible

2019-01-22 Thread Simeon Simeonov (JIRA)
Simeon Simeonov created SPARK-26696:
---

 Summary: Dataset encoder should be publicly accessible
 Key: SPARK-26696
 URL: https://issues.apache.org/jira/browse/SPARK-26696
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.0
Reporter: Simeon Simeonov


As a platform, Spark should enable framework developers to accomplish outside 
of the Spark codebase much of what can be accomplished inside the Spark 
codebase. One of the obstacles to this is a historical pattern of excessive 
data hiding in Spark, e.g., {{expr}} in {{Column}} not being accessible. This 
issue is an example of this pattern when it comes to {{Dataset}}.

Consider a transformation with the signature `def foo[A](ds: Dataset[A]): 
Dataset[A]`, which requires the use of {{toDF()}}. To get back to 
{{Dataset[A]}} would require calling {{.as[A]}}, which requires an implicit 
{{Encoder[A]}}. A naive approach would change the function signature to `foo[A 
: Encoder]` but this is poor API design that requires unnecessarily carrying of 
implicits from user code into framework code. We know `Encoder[A]` exists 
because we have access to an instance of `Dataset[A]`... but its `encoder` is 
not accessible.

The solution is simple: make {{encoder}} a {{@transient val}} just as is the 
case with {{queryExecution}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26084) AggregateExpression.references fails on unresolved expression trees

2018-11-17 Thread Simeon Simeonov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16690751#comment-16690751
 ] 

Simeon Simeonov commented on SPARK-26084:
-

[~hvanhovell] done [https://github.com/apache/spark/pull/23075]

> AggregateExpression.references fails on unresolved expression trees
> ---
>
> Key: SPARK-26084
> URL: https://issues.apache.org/jira/browse/SPARK-26084
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Simeon Simeonov
>Priority: Major
>  Labels: aggregate, regression, sql
>
> [SPARK-18394|https://issues.apache.org/jira/browse/SPARK-18394] introduced a 
> stable ordering in {{AttributeSet.toSeq}} using expression IDs 
> ([PR-18959|https://github.com/apache/spark/pull/18959/files#diff-75576f0ec7f9d8b5032000245217d233R128])
>  without noticing that {{AggregateExpression.references}} used 
> {{AttributeSet.toSeq}} as a shortcut 
> ([link|https://github.com/apache/spark/blob/5264164a67df498b73facae207eda12ee133be7d/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/interfaces.scala#L132]).
>  The net result is that {{AggregateExpression.references}} fails for 
> unresolved aggregate functions.
> {code:scala}
> org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression(
>   org.apache.spark.sql.catalyst.expressions.aggregate.Sum(('x + 'y).expr),
>   mode = org.apache.spark.sql.catalyst.expressions.aggregate.Complete,
>   isDistinct = false
> ).references
> {code}
> fails with
> {code:scala}
> org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to 
> exprId on unresolved object, tree: 'y
>   at 
> org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.exprId(unresolved.scala:104)
>   at 
> org.apache.spark.sql.catalyst.expressions.AttributeSet$$anonfun$toSeq$2.apply(AttributeSet.scala:128)
>   at 
> org.apache.spark.sql.catalyst.expressions.AttributeSet$$anonfun$toSeq$2.apply(AttributeSet.scala:128)
>   at scala.math.Ordering$$anon$5.compare(Ordering.scala:122)
>   at java.util.TimSort.countRunAndMakeAscending(TimSort.java:355)
>   at java.util.TimSort.sort(TimSort.java:220)
>   at java.util.Arrays.sort(Arrays.java:1438)
>   at scala.collection.SeqLike$class.sorted(SeqLike.scala:648)
>   at scala.collection.AbstractSeq.sorted(Seq.scala:41)
>   at scala.collection.SeqLike$class.sortBy(SeqLike.scala:623)
>   at scala.collection.AbstractSeq.sortBy(Seq.scala:41)
>   at 
> org.apache.spark.sql.catalyst.expressions.AttributeSet.toSeq(AttributeSet.scala:128)
>   at 
> org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression.references(interfaces.scala:201)
> {code}
> The solution is to avoid calling {{toSeq}} as ordering is not important in 
> {{references}} and simplify (and speed up) the implementation to something 
> like
> {code:scala}
> mode match {
>   case Partial | Complete => aggregateFunction.references
>   case PartialMerge | Final => 
> AttributeSet(aggregateFunction.aggBufferAttributes)
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26084) AggregateExpression.references fails on unresolved expression trees

2018-11-15 Thread Simeon Simeonov (JIRA)
Simeon Simeonov created SPARK-26084:
---

 Summary: AggregateExpression.references fails on unresolved 
expression trees
 Key: SPARK-26084
 URL: https://issues.apache.org/jira/browse/SPARK-26084
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.1
Reporter: Simeon Simeonov


[SPARK-18394|https://issues.apache.org/jira/browse/SPARK-18394] introduced a 
stable ordering in {{AttributeSet.toSeq}} using expression IDs 
([PR-18959|https://github.com/apache/spark/pull/18959/files#diff-75576f0ec7f9d8b5032000245217d233R128])
 without noticing that {{AggregateExpression.references}} used 
{{AttributeSet.toSeq}} as a shortcut 
([link|https://github.com/apache/spark/blob/5264164a67df498b73facae207eda12ee133be7d/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/interfaces.scala#L132]).
 The net result is that {{AggregateExpression.references}} fails for unresolved 
aggregate functions.

{code:scala}
org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression(
  org.apache.spark.sql.catalyst.expressions.aggregate.Sum(('x + 'y).expr),
  mode = org.apache.spark.sql.catalyst.expressions.aggregate.Complete,
  isDistinct = false
).references
{code}

fails with

{code:scala}
org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to 
exprId on unresolved object, tree: 'y
at 
org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.exprId(unresolved.scala:104)
at 
org.apache.spark.sql.catalyst.expressions.AttributeSet$$anonfun$toSeq$2.apply(AttributeSet.scala:128)
at 
org.apache.spark.sql.catalyst.expressions.AttributeSet$$anonfun$toSeq$2.apply(AttributeSet.scala:128)
at scala.math.Ordering$$anon$5.compare(Ordering.scala:122)
at java.util.TimSort.countRunAndMakeAscending(TimSort.java:355)
at java.util.TimSort.sort(TimSort.java:220)
at java.util.Arrays.sort(Arrays.java:1438)
at scala.collection.SeqLike$class.sorted(SeqLike.scala:648)
at scala.collection.AbstractSeq.sorted(Seq.scala:41)
at scala.collection.SeqLike$class.sortBy(SeqLike.scala:623)
at scala.collection.AbstractSeq.sortBy(Seq.scala:41)
at 
org.apache.spark.sql.catalyst.expressions.AttributeSet.toSeq(AttributeSet.scala:128)
at 
org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression.references(interfaces.scala:201)
{code}

The solution is to avoid calling {{toSeq}} as ordering is not important in 
{{references}} and simplify (and speed up) the implementation to something like

{code:scala}
mode match {
  case Partial | Complete => aggregateFunction.references
  case PartialMerge | Final => 
AttributeSet(aggregateFunction.aggBufferAttributes)
}
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26084) AggregateExpression.references fails on unresolved expression trees

2018-11-15 Thread Simeon Simeonov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16688760#comment-16688760
 ] 

Simeon Simeonov commented on SPARK-26084:
-

/cc [~maropu] [~hvanhovell] who worked on the PR that may have caused this 
problem

> AggregateExpression.references fails on unresolved expression trees
> ---
>
> Key: SPARK-26084
> URL: https://issues.apache.org/jira/browse/SPARK-26084
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Simeon Simeonov
>Priority: Major
>  Labels: aggregate, regression, sql
>
> [SPARK-18394|https://issues.apache.org/jira/browse/SPARK-18394] introduced a 
> stable ordering in {{AttributeSet.toSeq}} using expression IDs 
> ([PR-18959|https://github.com/apache/spark/pull/18959/files#diff-75576f0ec7f9d8b5032000245217d233R128])
>  without noticing that {{AggregateExpression.references}} used 
> {{AttributeSet.toSeq}} as a shortcut 
> ([link|https://github.com/apache/spark/blob/5264164a67df498b73facae207eda12ee133be7d/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/interfaces.scala#L132]).
>  The net result is that {{AggregateExpression.references}} fails for 
> unresolved aggregate functions.
> {code:scala}
> org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression(
>   org.apache.spark.sql.catalyst.expressions.aggregate.Sum(('x + 'y).expr),
>   mode = org.apache.spark.sql.catalyst.expressions.aggregate.Complete,
>   isDistinct = false
> ).references
> {code}
> fails with
> {code:scala}
> org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to 
> exprId on unresolved object, tree: 'y
>   at 
> org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.exprId(unresolved.scala:104)
>   at 
> org.apache.spark.sql.catalyst.expressions.AttributeSet$$anonfun$toSeq$2.apply(AttributeSet.scala:128)
>   at 
> org.apache.spark.sql.catalyst.expressions.AttributeSet$$anonfun$toSeq$2.apply(AttributeSet.scala:128)
>   at scala.math.Ordering$$anon$5.compare(Ordering.scala:122)
>   at java.util.TimSort.countRunAndMakeAscending(TimSort.java:355)
>   at java.util.TimSort.sort(TimSort.java:220)
>   at java.util.Arrays.sort(Arrays.java:1438)
>   at scala.collection.SeqLike$class.sorted(SeqLike.scala:648)
>   at scala.collection.AbstractSeq.sorted(Seq.scala:41)
>   at scala.collection.SeqLike$class.sortBy(SeqLike.scala:623)
>   at scala.collection.AbstractSeq.sortBy(Seq.scala:41)
>   at 
> org.apache.spark.sql.catalyst.expressions.AttributeSet.toSeq(AttributeSet.scala:128)
>   at 
> org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression.references(interfaces.scala:201)
> {code}
> The solution is to avoid calling {{toSeq}} as ordering is not important in 
> {{references}} and simplify (and speed up) the implementation to something 
> like
> {code:scala}
> mode match {
>   case Partial | Complete => aggregateFunction.references
>   case PartialMerge | Final => 
> AttributeSet(aggregateFunction.aggBufferAttributes)
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25769) UnresolvedAttribute.sql() incorrectly escapes nested columns

2018-10-18 Thread Simeon Simeonov (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Simeon Simeonov updated SPARK-25769:

Description: 
{{UnresolvedAttribute.sql()}} output is incorrectly escaped for nested columns
{code:java}
import org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute

// The correct output is a.b, without backticks, or `a`.`b`.
$"a.b".expr.asInstanceOf[UnresolvedAttribute].sql
// res1: String = `a.b`

// Parsing is correct; the bug is localized to sql() 
$"a.b".expr.asInstanceOf[UnresolvedAttribute].nameParts 
// res2: Seq[String] = ArrayBuffer(a, b)
{code}
The likely culprit is that the {{sql()}} implementation does not check for 
{{nameParts}} being non-empty.
{code:java}
override def sql: String = name match { 
  case ParserUtils.escapedIdentifier(_) | 
ParserUtils.qualifiedEscapedIdentifier(_, _) => name 
  case _ => quoteIdentifier(name) 
}
{code}
 

  was:
 This issue affects dynamic SQL generation that relies on {{sql()}}.
{code:java}
import org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute

// The correct output is a.b, without backticks, or `a`.`b`.
$"a.b".expr.asInstanceOf[UnresolvedAttribute].sql
// res2: String = `a.b`

// Parsing is correct; the bug is localized to sql() 
$"a.b".expr.asInstanceOf[UnresolvedAttribute].nameParts 
// res1: Seq[String] = ArrayBuffer(a, b)
{code}
The likely culprit is that the {{sql()}} implementation does not check for 
{{nameParts}} being non-empty.
{code:java}
override def sql: String = name match { 
  case ParserUtils.escapedIdentifier(_) | 
ParserUtils.qualifiedEscapedIdentifier(_, _) => name 
  case _ => quoteIdentifier(name) 
}
{code}
 


> UnresolvedAttribute.sql() incorrectly escapes nested columns
> 
>
> Key: SPARK-25769
> URL: https://issues.apache.org/jira/browse/SPARK-25769
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2
>Reporter: Simeon Simeonov
>Priority: Major
>  Labels: sql
>
> {{UnresolvedAttribute.sql()}} output is incorrectly escaped for nested columns
> {code:java}
> import org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute
> // The correct output is a.b, without backticks, or `a`.`b`.
> $"a.b".expr.asInstanceOf[UnresolvedAttribute].sql
> // res1: String = `a.b`
> // Parsing is correct; the bug is localized to sql() 
> $"a.b".expr.asInstanceOf[UnresolvedAttribute].nameParts 
> // res2: Seq[String] = ArrayBuffer(a, b)
> {code}
> The likely culprit is that the {{sql()}} implementation does not check for 
> {{nameParts}} being non-empty.
> {code:java}
> override def sql: String = name match { 
>   case ParserUtils.escapedIdentifier(_) | 
> ParserUtils.qualifiedEscapedIdentifier(_, _) => name 
>   case _ => quoteIdentifier(name) 
> }
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25769) UnresolvedAttribute.sql() incorrectly escapes nested columns

2018-10-18 Thread Simeon Simeonov (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Simeon Simeonov updated SPARK-25769:

Description: 
 This issue affects dynamic SQL generation that relies on {{sql()}}.
{code:java}
import org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute

// The correct output is a.b, without backticks, or `a`.`b`.
$"a.b".expr.asInstanceOf[UnresolvedAttribute].sql
// res2: String = `a.b`

// Parsing is correct; the bug is localized to sql() 
$"a.b".expr.asInstanceOf[UnresolvedAttribute].nameParts 
// res1: Seq[String] = ArrayBuffer(a, b)
{code}
The likely culprit is that the {{sql()}} implementation does not check for 
{{nameParts}} being non-empty.
{code:java}
override def sql: String = name match { 
  case ParserUtils.escapedIdentifier(_) | 
ParserUtils.qualifiedEscapedIdentifier(_, _) => name 
  case _ => quoteIdentifier(name) 
}
{code}
 

  was:
 

This issue affects dynamic SQL generation that relies on {{sql()}}.

 
{code:java}
import org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute

// The correct output is a.b, without backticks, or `a`.`b`.
$"a.b".expr.asInstanceOf[UnresolvedAttribute].sql
// res2: String = `a.b`

// Parsing is correct; the bug is localized to sql() 
$"a.b".expr.asInstanceOf[UnresolvedAttribute].nameParts 
// res1: Seq[String] = ArrayBuffer(a, b)
{code}
 

The likely culprit is that the {{sql()}} implementation does not check for 
{{nameParts}} being non-empty.

 

 
{code:java}
override def sql: String = name match { 
  case ParserUtils.escapedIdentifier(_) | 
ParserUtils.qualifiedEscapedIdentifier(_, _) => name 
  case _ => quoteIdentifier(name) 
}
{code}
 


> UnresolvedAttribute.sql() incorrectly escapes nested columns
> 
>
> Key: SPARK-25769
> URL: https://issues.apache.org/jira/browse/SPARK-25769
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2
>Reporter: Simeon Simeonov
>Priority: Major
>  Labels: sql
>
>  This issue affects dynamic SQL generation that relies on {{sql()}}.
> {code:java}
> import org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute
> // The correct output is a.b, without backticks, or `a`.`b`.
> $"a.b".expr.asInstanceOf[UnresolvedAttribute].sql
> // res2: String = `a.b`
> // Parsing is correct; the bug is localized to sql() 
> $"a.b".expr.asInstanceOf[UnresolvedAttribute].nameParts 
> // res1: Seq[String] = ArrayBuffer(a, b)
> {code}
> The likely culprit is that the {{sql()}} implementation does not check for 
> {{nameParts}} being non-empty.
> {code:java}
> override def sql: String = name match { 
>   case ParserUtils.escapedIdentifier(_) | 
> ParserUtils.qualifiedEscapedIdentifier(_, _) => name 
>   case _ => quoteIdentifier(name) 
> }
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25769) UnresolvedAttribute.sql() incorrectly escapes nested columns

2018-10-18 Thread Simeon Simeonov (JIRA)
Simeon Simeonov created SPARK-25769:
---

 Summary: UnresolvedAttribute.sql() incorrectly escapes nested 
columns
 Key: SPARK-25769
 URL: https://issues.apache.org/jira/browse/SPARK-25769
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.2
Reporter: Simeon Simeonov


 

This issue affects dynamic SQL generation that relies on {{sql()}}.

 
{code:java}
import org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute

// The correct output is a.b, without backticks, or `a`.`b`.
$"a.b".expr.asInstanceOf[UnresolvedAttribute].sql
// res2: String = `a.b`

// Parsing is correct; the bug is localized to sql() 
$"a.b".expr.asInstanceOf[UnresolvedAttribute].nameParts 
// res1: Seq[String] = ArrayBuffer(a, b)
{code}
 

The likely culprit is that the {{sql()}} implementation does not check for 
{{nameParts}} being non-empty.

 

 
{code:java}
override def sql: String = name match { 
  case ParserUtils.escapedIdentifier(_) | 
ParserUtils.qualifiedEscapedIdentifier(_, _) => name 
  case _ => quoteIdentifier(name) 
}
{code}
 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16483) Unifying struct fields and columns

2018-07-22 Thread Simeon Simeonov (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Simeon Simeonov updated SPARK-16483:

Description: 
This issue comes as a result of an exchange with Michael Armbrust outside of 
the usual JIRA/dev list channels.

DataFrame provides a full set of manipulation operations for top-level columns. 
They have be added, removed, modified and renamed. The same is not true about 
fields inside structs yet, from a logical standpoint, Spark users may very well 
want to perform the same operations on struct fields, especially since 
automatic schema discovery from JSON input tends to create deeply nested 
structs.

Common use-cases include:
 - Remove and/or rename struct field(s) to adjust the schema
 - Fix a data quality issue with a struct field (update/rewrite)

To do this with the existing API by hand requires manually calling 
{{named_struct}} and listing all fields, including ones we don't want to 
manipulate. This leads to complex, fragile code that cannot survive schema 
evolution.

It would be far better if the various APIs that can now manipulate top-level 
columns were extended to handle struct fields at arbitrary locations or, 
alternatively, if we introduced new APIs for modifying any field in a 
dataframe, whether it is a top-level one or one nested inside a struct.

Purely for discussion purposes (overloaded methods are not shown):
{code:java}
class Column(val expr: Expression) extends Logging {

  // ...

  // matches Dataset.schema semantics
  def schema: StructType

  // matches Dataset.select() semantics
  // '* support allows multiple new fields to be added easily, saving 
cumbersome repeated withColumn() calls
  def select(cols: Column*): Column

  // matches Dataset.withColumn() semantics of add or replace
  def withColumn(colName: String, col: Column): Column

  // matches Dataset.drop() semantics
  def drop(colName: String): Column

}

class Dataset[T] ... {

  // ...

  // Equivalent to sparkSession.createDataset(toDF.rdd, newSchema)
  def cast(newShema: StructType): DataFrame

}
{code}
The benefit of the above API is that it unifies manipulating top-level & nested 
columns. The addition of {{schema}} and {{select()}} to {{Column}} allows for 
nested field reordering, casting, etc., which is important in data exchange 
scenarios where field position matters. That's also the reason to add {{cast}} 
to {{Dataset}}: it improves consistency and readability (with method chaining). 
Another way to think of {{Dataset.cast}} is as the Spark schema equivalent of 
{{Dataset.as}}. {{as}} is to {{cast}} as a Scala encodable type is to a 
{{StructType}} instance.

  was:
This issue comes as a result of an exchange with Michael Armbrust outside of 
the usual JIRA/dev list channels.

DataFrame provides a full set of manipulation operations for top-level columns. 
They have be added, removed, modified and renamed. The same is not true about 
fields inside structs yet, from a logical standpoint, Spark users may very well 
want to perform the same operations on struct fields, especially since 
automatic schema discovery from JSON input tends to create deeply nested 
structs.

Common use-cases include:
 - Remove and/or rename struct field(s) to adjust the schema
 - Fix a data quality issue with a struct field (update/rewrite)

To do this with the existing API by hand requires manually calling 
{{named_struct}} and listing all fields, including ones we don't want to 
manipulate. This leads to complex, fragile code that cannot survive schema 
evolution.

It would be far better if the various APIs that can now manipulate top-level 
columns were extended to handle struct fields at arbitrary locations or, 
alternatively, if we introduced new APIs for modifying any field in a 
dataframe, whether it is a top-level one or one nested inside a struct.

Purely for discussion purposes (overloaded methods are not shown):
{code:java}
class Column(val expr: Expression) extends Logging {

  // ...

  // matches Dataset.schema semantics
  def schema: StructType

  // matches Dataset.select() semantics
  // '* support allows multiple new fields to be added easily, saving 
cumbersome repeated withColumn() calls
  def select(cols: Column*): Column

  // matches Dataset.withColumn() semantics of add or replace
  def withColumn(colName: String, col: Column): Column

  // matches Dataset.drop() semantics
  def drop(colName: String): Column

}

class Dataset[T] ... {

  // ...

  // Equivalent to sparkSession.createDataset(toDF.rdd, newSchema)
  def cast(newShema: StructType): DataFrame

}
{code}
The benefit of the above API is that it unifies manipulating top-level & nested 
columns. The addition of {{schema}} and {{select()}} to {{Column}} allows for 
nested field reordering, casting, etc., which is important in data exchange 
scenarios where field position matters. That's also the reason to add {{cast}} 
to 

[jira] [Updated] (SPARK-16483) Unifying struct fields and columns

2018-07-22 Thread Simeon Simeonov (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Simeon Simeonov updated SPARK-16483:

Affects Version/s: 2.3.1
  Description: 
This issue comes as a result of an exchange with Michael Armbrust outside of 
the usual JIRA/dev list channels.

DataFrame provides a full set of manipulation operations for top-level columns. 
They have be added, removed, modified and renamed. The same is not true about 
fields inside structs yet, from a logical standpoint, Spark users may very well 
want to perform the same operations on struct fields, especially since 
automatic schema discovery from JSON input tends to create deeply nested 
structs.

Common use-cases include:
 - Remove and/or rename struct field(s) to adjust the schema
 - Fix a data quality issue with a struct field (update/rewrite)

To do this with the existing API by hand requires manually calling 
{{named_struct}} and listing all fields, including ones we don't want to 
manipulate. This leads to complex, fragile code that cannot survive schema 
evolution.

It would be far better if the various APIs that can now manipulate top-level 
columns were extended to handle struct fields at arbitrary locations or, 
alternatively, if we introduced new APIs for modifying any field in a 
dataframe, whether it is a top-level one or one nested inside a struct.

Purely for discussion purposes (overloaded methods are not shown):
{code:java}
class Column(val expr: Expression) extends Logging {

  // ...

  // matches Dataset.schema semantics
  def schema: StructType

  // matches Dataset.select() semantics
  // '* support allows multiple new fields to be added easily, saving 
cumbersome repeated withColumn() calls
  def select(cols: Column*): Column

  // matches Dataset.withColumn() semantics of add or replace
  def withColumn(colName: String, col: Column): Column

  // matches Dataset.drop() semantics
  def drop(colName: String): Column

}

class Dataset[T] ... {

  // ...

  // Equivalent to sparkSession.createDataset(toDF.rdd, newSchema)
  def cast(newShema: StructType): DataFrame

}
{code}
The benefit of the above API is that it unifies manipulating top-level & nested 
columns. The addition of {{schema}} and {{select()}} to {{Column}} allows for 
nested field reordering, casting, etc., which is important in data exchange 
scenarios where field position matters. That's also the reason to add {{cast}} 
to {{Dataset}}: it improves consistency and readability (with method chaining).

  was:
This issue comes as a result of an exchange with Michael Armbrust outside of 
the usual JIRA/dev list channels. 

DataFrame provides a full set of manipulation operations for top-level columns. 
They have be added, removed, modified and renamed. The same is not true about 
fields inside structs yet, from a logical standpoint, Spark users may very well 
want to perform the same operations on struct fields, especially since 
automatic schema discovery from JSON input tends to create deeply nested 
structs.

Common use-cases include:

- Remove and/or rename struct field(s) to adjust the schema
- Fix a data quality issue with a struct field (update/rewrite)

To do this with the existing API by hand requires manually calling 
{{named_struct}} and listing all fields, including ones we don't want to 
manipulate. This leads to complex, fragile code that cannot survive schema 
evolution.

It would be far better if the various APIs that can now manipulate top-level 
columns were extended to handle struct fields at arbitrary locations or, 
alternatively, if we introduced new APIs for modifying any field in a 
dataframe, whether it is a top-level one or one nested inside a struct.

Purely for discussion purposes, here is the skeleton implementation of an 
update() implicit that we've use to modify any existing field in a dataframe. 
(Note that it depends on various other utilities and implicits that are not 
included). https://gist.github.com/ssimeonov/f98dcfa03cd067157fa08aaa688b0f66


> Unifying struct fields and columns
> --
>
> Key: SPARK-16483
> URL: https://issues.apache.org/jira/browse/SPARK-16483
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Simeon Simeonov
>Priority: Major
>  Labels: sql
>
> This issue comes as a result of an exchange with Michael Armbrust outside of 
> the usual JIRA/dev list channels.
> DataFrame provides a full set of manipulation operations for top-level 
> columns. They have be added, removed, modified and renamed. The same is not 
> true about fields inside structs yet, from a logical standpoint, Spark users 
> may very well want to perform the same operations on struct fields, 
> especially since automatic schema discovery from JSON input tends to create 
> 

[jira] [Commented] (SPARK-24269) Infer nullability rather than declaring all columns as nullable

2018-05-22 Thread Simeon Simeonov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16484071#comment-16484071
 ] 

Simeon Simeonov commented on SPARK-24269:
-

There are many reasons why correct nullability inference is important for any 
data source, not just CSV & JSON. 
 # It can be used to verify the foundation of data contracts, especially in 
data exchange with third parties via something as simple as schema (StructType) 
equality. The common practice is to persist a JSON representation of the 
expected schema.
 # It can substantially improve performance and reduce memory use when dealing 
with Dataset[A <: Product] by using B <: AnyVal directly in case classes as 
opposed to via Option[B].
 # It can simplify the use of code-generation tools.

As an example of (2), consider the following:
{code:java}
import org.apache.spark.util.SizeEstimator
import scala.util.Random.nextInt

case class WithNulls(a: Option[Int], b: Option[Int])
case class WithoutNulls(a: Int, b: Int)

val sizeWith = SizeEstimator.estimate(WithNulls(Some(nextInt), Some(nextInt)))
// 88

val sizeWithout = SizeEstimator.estimate(WithoutNulls(nextInt, nextInt))
// 24

val percentMemoryReduction = 100.0 * (sizeWith - sizeWithout) / sizeWith
// 72.7{code}
I would argue that 70+% savings in memory use are a pretty big deal. The 
savings can be even bigger in the cases of many columns with small primitive 
types (Byte, Short, ...).

As an example of (3), consider tools that code-generate case classes from 
schema. We use tools like that at Swoop for efficient & performant 
transformations that cannot easily happen via the provided operations that work 
on internal rows. Without proper nullability inference, manual configuration 
has to be provided to these tools. We do this routinely, even for ad hoc data 
transformations in notebooks.

[~Teng Peng] I agree that this behavior should not be the default given Spark's 
current behavior. It should be activated via an option.

> Infer nullability rather than declaring all columns as nullable
> ---
>
> Key: SPARK-24269
> URL: https://issues.apache.org/jira/browse/SPARK-24269
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Maxim Gekk
>Priority: Minor
>
> Currently, CSV and JSON datasource set the *nullable* flag to true 
> independently from data itself during schema inferring.
> JSON: 
> https://github.com/apache/spark/blob/branch-2.3/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JsonInferSchema.scala#L126
> CSV: 
> https://github.com/apache/spark/blob/branch-2.3/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVInferSchema.scala#L51
> For example, source dataset has schema:
> {code}
> root
>  |-- item_id: integer (nullable = false)
>  |-- country: string (nullable = false)
>  |-- state: string (nullable = false)
> {code}
> If we save it and read again the schema of the inferred dataset is
> {code}
> root
>  |-- item_id: integer (nullable = true)
>  |-- country: string (nullable = true)
>  |-- state: string (nullable = true)
> {code}
> The ticket aims to set the nullable flag more precisely during schema 
> inferring based on read data.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4502) Spark SQL reads unneccesary nested fields from Parquet

2018-01-25 Thread Simeon Simeonov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16340515#comment-16340515
 ] 

Simeon Simeonov commented on SPARK-4502:


+1 [~holdenk] this should be a big boost for any Spark user that is not working 
with flat data. In tests I did a while back, the performance difference between 
a nested and a flat schema was > 3x.

> Spark SQL reads unneccesary nested fields from Parquet
> --
>
> Key: SPARK-4502
> URL: https://issues.apache.org/jira/browse/SPARK-4502
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.1.0
>Reporter: Liwen Sun
>Priority: Critical
>
> When reading a field of a nested column from Parquet, SparkSQL reads and 
> assemble all the fields of that nested column. This is unnecessary, as 
> Parquet supports fine-grained field reads out of a nested column. This may 
> degrades the performance significantly when a nested column has many fields. 
> For example, I loaded json tweets data into SparkSQL and ran the following 
> query:
> {{SELECT User.contributors_enabled from Tweets;}}
> User is a nested structure that has 38 primitive fields (for Tweets schema, 
> see: https://dev.twitter.com/overview/api/tweets), here is the log message:
> {{14/11/19 16:36:49 INFO InternalParquetRecordReader: Assembled and processed 
> 385779 records from 38 columns in 3976 ms: 97.02691 rec/ms, 3687.0227 
> cell/ms}}
> For comparison, I also ran:
> {{SELECT User FROM Tweets;}}
> And here is the log message:
> {{14/11/19 16:45:40 INFO InternalParquetRecordReader: Assembled and processed 
> 385779 records from 38 columns in 9461 ms: 40.77571 rec/ms, 1549.477 cell/ms}}
> So both queries load 38 columns from Parquet, while the first query only 
> needs 1 column. I also measured the bytes read within Parquet. In these two 
> cases, the same number of bytes (99365194 bytes) were read. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18727) Support schema evolution as new files are inserted into table

2017-04-27 Thread Simeon Simeonov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15987668#comment-15987668
 ] 

Simeon Simeonov commented on SPARK-18727:
-

[~xwu0226] The merged PR handles the use case of new top-level columns but, in 
the test cases, I did not see any examples of adding new fields to (nested) 
struct columns, a requirement for supporting schema evolution (and closing this 
ticket). Do you expect you'll work on that also?

> Support schema evolution as new files are inserted into table
> -
>
> Key: SPARK-18727
> URL: https://issues.apache.org/jira/browse/SPARK-18727
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Eric Liang
>Priority: Critical
>
> Now that we have pushed partition management of all tables to the catalog, 
> one issue for scalable partition handling remains: handling schema updates.
> Currently, a schema update requires dropping and recreating the entire table, 
> which does not scale well with the size of the table.
> We should support updating the schema of the table, either via ALTER TABLE, 
> or automatically as new files with compatible schemas are appended into the 
> table.
> cc [~rxin]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19716) Dataset should allow by-name resolution for struct type elements in array

2017-02-23 Thread Simeon Simeonov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15881490#comment-15881490
 ] 

Simeon Simeonov commented on SPARK-19716:
-

This is an important issue because it prevent schema evolution with datasets 
that is {{mergeSchema=true}} compatible for dataframes. This means two things:

1. Customers currently using dataframes with non-trivial schema may not be able 
to migrate to datasets.
2. Customers that migrate (or start) using datasets may be stuck not being able 
to evolve their schema.

> Dataset should allow by-name resolution for struct type elements in array
> -
>
> Key: SPARK-19716
> URL: https://issues.apache.org/jira/browse/SPARK-19716
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Wenchen Fan
>
> if we have a DataFrame with schema {{a: int, b: int, c: int}}, and convert it 
> to Dataset with {{case class Data(a: Int, c: Int)}}, it works and we will 
> extract the `a` and `c` columns to build the Data.
> However, if the struct is inside array, e.g. schema is {{arr: array>}}, and we wanna convert it to Dataset with {{case class 
> ComplexData(arr: Seq[Data])}}, we will fail. The reason is, to allow 
> compatible types, e.g. convert {{a: int}} to {{case class A(a: Long)}}, we 
> will add cast for each field, except struct type field, because struct type 
> is flexible, the number of columns can mismatch. We should probably also skip 
> cast for array and map type.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18727) Support schema evolution as new files are inserted into table

2017-01-10 Thread Simeon Simeonov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15816538#comment-15816538
 ] 

Simeon Simeonov commented on SPARK-18727:
-

[~xwu0226] A common use case is adding a new field in a nested struct. In that 
case the top-level columns don't change but the schema at least one top-level 
struct column would change. A simple rule of thumb would be that you'd want to 
handle anything that would work well with {{.read.option("mergeSchema", 
"true")}}.

> Support schema evolution as new files are inserted into table
> -
>
> Key: SPARK-18727
> URL: https://issues.apache.org/jira/browse/SPARK-18727
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Eric Liang
>Priority: Critical
>
> Now that we have pushed partition management of all tables to the catalog, 
> one issue for scalable partition handling remains: handling schema updates.
> Currently, a schema update requires dropping and recreating the entire table, 
> which does not scale well with the size of the table.
> We should support updating the schema of the table, either via ALTER TABLE, 
> or automatically as new files with compatible schemas are appended into the 
> table.
> cc [~rxin]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16954) UDFs should allow output type to be specified in terms of the input type

2016-08-08 Thread Simeon Simeonov (JIRA)
Simeon Simeonov created SPARK-16954:
---

 Summary: UDFs should allow output type to be specified in terms of 
the input type
 Key: SPARK-16954
 URL: https://issues.apache.org/jira/browse/SPARK-16954
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Simeon Simeonov


Consider an {{array_compact}} UDF that removes {{null}} values from an array. 
There is no easy way to implement this UDF because an explicit return type with 
a {{TypeTag}} is required by code generation.

The interesting observation here is that the output type of `array_compact` is 
the same as its input type. In general, there is a broad class of UDFs, 
especially collection-oriented ones, whose output types are functions of the 
input types. In our Spark work we have found collection manipulation UDFs to be 
very powerful for cleaning up data and substantially improving performance, in 
particular, avoiding {{explode}} followed by {{groupBy}}. It would be nice if 
Spark made adding these types of UDFs very easy.

I won't go into possible ways to implement this under the covers as there are 
many options but I do want to point out that it is possible to communicate the 
right type information to Spark without changing the signature for UDF 
registration using placeholder types, e.g.,

{code}
sealed trait UDFArgumentAtPosition
case class ArgPos1 extends UDFArgumentAtPosition
case class ArgPos2 extends UDFArgumentAtPosition
// ...

case class Struct[ArgPos <: UDFArgumentAtPosition](value: Row)
case class ArrayElement[ArgPos <: UDFArgumentAtPosition, A : TypeTag](value: A)
case class MapKey[ArgPos <: UDFArgumentAtPosition, A : TypeTag](value: A)
case class MapValue[ArgPos <: UDFArgumentAtPosition, A : TypeTag](value: A)

// Functions are stubbed just to show compilation succeeds
def arrayCompact[A : TypeTag](xs: Seq[A]): ArgPos1 = null
def arraySum[A : Numeric : TypeTag](xs: Seq[A]): ArrayElement[ArgPos1, A] = 
ArrayElement(implicitly[Numeric[A]].zero) 
{code}





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16483) Unifying struct fields and columns

2016-07-11 Thread Simeon Simeonov (JIRA)
Simeon Simeonov created SPARK-16483:
---

 Summary: Unifying struct fields and columns
 Key: SPARK-16483
 URL: https://issues.apache.org/jira/browse/SPARK-16483
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Simeon Simeonov


This issue comes as a result of an exchange with Michael Armbrust outside of 
the usual JIRA/dev list channels. 

DataFrame provides a full set of manipulation operations for top-level columns. 
They have be added, removed, modified and renamed. The same is not true about 
fields inside structs yet, from a logical standpoint, Spark users may very well 
want to perform the same operations on struct fields, especially since 
automatic schema discovery from JSON input tends to create deeply nested 
structs.

Common use-cases include:

- Remove and/or rename struct field(s) to adjust the schema
- Fix a data quality issue with a struct field (update/rewrite)

To do this with the existing API by hand requires manually calling 
{{named_struct}} and listing all fields, including ones we don't want to 
manipulate. This leads to complex, fragile code that cannot survive schema 
evolution.

It would be far better if the various APIs that can now manipulate top-level 
columns were extended to handle struct fields at arbitrary locations or, 
alternatively, if we introduced new APIs for modifying any field in a 
dataframe, whether it is a top-level one or one nested inside a struct.

Purely for discussion purposes, here is the skeleton implementation of an 
update() implicit that we've use to modify any existing field in a dataframe. 
(Note that it depends on various other utilities and implicits that are not 
included). https://gist.github.com/ssimeonov/f98dcfa03cd067157fa08aaa688b0f66



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16210) DataFrame.drop(colName) fails if another column has a period in its name

2016-06-25 Thread Simeon Simeonov (JIRA)
Simeon Simeonov created SPARK-16210:
---

 Summary: DataFrame.drop(colName) fails if another column has a 
period in its name
 Key: SPARK-16210
 URL: https://issues.apache.org/jira/browse/SPARK-16210
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.6.1
 Environment: Spark 1.6.1 on Databricks
Reporter: Simeon Simeonov


The following code fails with {{org.apache.spark.sql.AnalysisException: cannot 
resolve 'x.y' given input columns: [abc, x.y]}} because of the way {{drop()}} 
uses {{select()}} under the covers.

{code}
val rdd = sc.makeRDD("""{"x.y": 5, "abc": 10}""" :: Nil)
sqlContext.read.json(rdd).drop("abc")
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14048) Aggregation operations on structs fail when the structs have fields with special characters

2016-06-16 Thread Simeon Simeonov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15335419#comment-15335419
 ] 

Simeon Simeonov commented on SPARK-14048:
-

I can confirm that this workaround works.

> Aggregation operations on structs fail when the structs have fields with 
> special characters
> ---
>
> Key: SPARK-14048
> URL: https://issues.apache.org/jira/browse/SPARK-14048
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
> Environment: Databricks w/ 1.6.0
>Reporter: Simeon Simeonov
>  Labels: sql
> Attachments: bug_structs_with_backticks.html
>
>
> Consider a schema where a struct has field names with special characters, 
> e.g.,
> {code}
>  |-- st: struct (nullable = true)
>  ||-- x.y: long (nullable = true)
> {code}
> Schema such as these are frequently generated by the JSON schema generator, 
> which seems to never want to map JSON data to {{MapType}} always preferring 
> to use {{StructType}}. 
> In SparkSQL, referring to these fields requires backticks, e.g., 
> {{st.`x.y`}}. There is no problem manipulating these structs unless one is 
> using an aggregation function. It seems that, under the covers, the code is 
> not escaping fields with special characters correctly.
> For example, 
> {code}
> select first(st) as st from tbl group by something
> {code}
> generates
> {code}
> org.apache.spark.sql.catalyst.util.DataTypeException: Unsupported dataType: 
> struct. If you have a struct and a field name of it has any 
> special characters, please use backticks (`) to quote that field name, e.g. 
> `x+y`. Please note that backtick itself is not supported in a field name.
>   at 
> org.apache.spark.sql.catalyst.util.DataTypeParser$class.toDataType(DataTypeParser.scala:100)
>   at 
> org.apache.spark.sql.catalyst.util.DataTypeParser$$anon$1.toDataType(DataTypeParser.scala:112)
>   at 
> org.apache.spark.sql.catalyst.util.DataTypeParser$.parse(DataTypeParser.scala:116)
>   at 
> org.apache.spark.sql.hive.HiveMetastoreTypes$.toDataType(HiveMetastoreCatalog.scala:884)
>   at 
> com.databricks.backend.daemon.driver.OutputAggregator$$anonfun$toJsonSchema$1.apply(OutputAggregator.scala:395)
>   at 
> com.databricks.backend.daemon.driver.OutputAggregator$$anonfun$toJsonSchema$1.apply(OutputAggregator.scala:394)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at 
> com.databricks.backend.daemon.driver.OutputAggregator$.toJsonSchema(OutputAggregator.scala:394)
>   at 
> com.databricks.backend.daemon.driver.OutputAggregator$.maybeApplyOutputAggregation(OutputAggregator.scala:122)
>   at 
> com.databricks.backend.daemon.driver.OutputAggregator$.withOutputAggregation0(OutputAggregator.scala:82)
>   at 
> com.databricks.backend.daemon.driver.OutputAggregator$.withOutputAggregation(OutputAggregator.scala:42)
>   at 
> com.databricks.backend.daemon.driver.DriverLocal.executeSql(DriverLocal.scala:306)
>   at 
> com.databricks.backend.daemon.driver.DriverLocal.execute(DriverLocal.scala:161)
>   at 
> com.databricks.backend.daemon.driver.DriverWrapper$$anonfun$3.apply(DriverWrapper.scala:467)
>   at 
> com.databricks.backend.daemon.driver.DriverWrapper$$anonfun$3.apply(DriverWrapper.scala:467)
>   at scala.util.Try$.apply(Try.scala:161)
>   at 
> com.databricks.backend.daemon.driver.DriverWrapper.executeCommand(DriverWrapper.scala:464)
>   at 
> com.databricks.backend.daemon.driver.DriverWrapper.runInner(DriverWrapper.scala:365)
>   at 
> com.databricks.backend.daemon.driver.DriverWrapper.run(DriverWrapper.scala:196)
>   at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-14048) Aggregation operations on structs fail when the structs have fields with special characters

2016-06-16 Thread Simeon Simeonov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15334456#comment-15334456
 ] 

Simeon Simeonov edited comment on SPARK-14048 at 6/16/16 7:02 PM:
--

[~clockfly] The above code executes with no error on the same cluster where the 
example I shared fails. As I had speculated earlier, there must be something in 
the particular data structures we have that triggers the problem, which you can 
see in the attached notebook.


was (Author: simeons):
[~clockfly] The code executes with no error on the same cluster where the 
example I shared fails. As I had speculated earlier, there must be something in 
the particular data structures we have that triggers the problem, which you can 
see in the attached notebook.

> Aggregation operations on structs fail when the structs have fields with 
> special characters
> ---
>
> Key: SPARK-14048
> URL: https://issues.apache.org/jira/browse/SPARK-14048
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
> Environment: Databricks w/ 1.6.0
>Reporter: Simeon Simeonov
>  Labels: sql
> Attachments: bug_structs_with_backticks.html
>
>
> Consider a schema where a struct has field names with special characters, 
> e.g.,
> {code}
>  |-- st: struct (nullable = true)
>  ||-- x.y: long (nullable = true)
> {code}
> Schema such as these are frequently generated by the JSON schema generator, 
> which seems to never want to map JSON data to {{MapType}} always preferring 
> to use {{StructType}}. 
> In SparkSQL, referring to these fields requires backticks, e.g., 
> {{st.`x.y`}}. There is no problem manipulating these structs unless one is 
> using an aggregation function. It seems that, under the covers, the code is 
> not escaping fields with special characters correctly.
> For example, 
> {code}
> select first(st) as st from tbl group by something
> {code}
> generates
> {code}
> org.apache.spark.sql.catalyst.util.DataTypeException: Unsupported dataType: 
> struct. If you have a struct and a field name of it has any 
> special characters, please use backticks (`) to quote that field name, e.g. 
> `x+y`. Please note that backtick itself is not supported in a field name.
>   at 
> org.apache.spark.sql.catalyst.util.DataTypeParser$class.toDataType(DataTypeParser.scala:100)
>   at 
> org.apache.spark.sql.catalyst.util.DataTypeParser$$anon$1.toDataType(DataTypeParser.scala:112)
>   at 
> org.apache.spark.sql.catalyst.util.DataTypeParser$.parse(DataTypeParser.scala:116)
>   at 
> org.apache.spark.sql.hive.HiveMetastoreTypes$.toDataType(HiveMetastoreCatalog.scala:884)
>   at 
> com.databricks.backend.daemon.driver.OutputAggregator$$anonfun$toJsonSchema$1.apply(OutputAggregator.scala:395)
>   at 
> com.databricks.backend.daemon.driver.OutputAggregator$$anonfun$toJsonSchema$1.apply(OutputAggregator.scala:394)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at 
> com.databricks.backend.daemon.driver.OutputAggregator$.toJsonSchema(OutputAggregator.scala:394)
>   at 
> com.databricks.backend.daemon.driver.OutputAggregator$.maybeApplyOutputAggregation(OutputAggregator.scala:122)
>   at 
> com.databricks.backend.daemon.driver.OutputAggregator$.withOutputAggregation0(OutputAggregator.scala:82)
>   at 
> com.databricks.backend.daemon.driver.OutputAggregator$.withOutputAggregation(OutputAggregator.scala:42)
>   at 
> com.databricks.backend.daemon.driver.DriverLocal.executeSql(DriverLocal.scala:306)
>   at 
> com.databricks.backend.daemon.driver.DriverLocal.execute(DriverLocal.scala:161)
>   at 
> com.databricks.backend.daemon.driver.DriverWrapper$$anonfun$3.apply(DriverWrapper.scala:467)
>   at 
> com.databricks.backend.daemon.driver.DriverWrapper$$anonfun$3.apply(DriverWrapper.scala:467)
>   at scala.util.Try$.apply(Try.scala:161)
>   at 
> com.databricks.backend.daemon.driver.DriverWrapper.executeCommand(DriverWrapper.scala:464)
>   at 
> com.databricks.backend.daemon.driver.DriverWrapper.runInner(DriverWrapper.scala:365)
>   at 
> com.databricks.backend.daemon.driver.DriverWrapper.run(DriverWrapper.scala:196)
>   at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: 

[jira] [Commented] (SPARK-14048) Aggregation operations on structs fail when the structs have fields with special characters

2016-06-16 Thread Simeon Simeonov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15334456#comment-15334456
 ] 

Simeon Simeonov commented on SPARK-14048:
-

[~clockfly] The code executes with no error on the same cluster where the 
example I shared fails. As I had speculated earlier, there must be something in 
the particular data structures we have that triggers the problem, which you can 
see in the attached notebook.

> Aggregation operations on structs fail when the structs have fields with 
> special characters
> ---
>
> Key: SPARK-14048
> URL: https://issues.apache.org/jira/browse/SPARK-14048
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
> Environment: Databricks w/ 1.6.0
>Reporter: Simeon Simeonov
>  Labels: sql
> Attachments: bug_structs_with_backticks.html
>
>
> Consider a schema where a struct has field names with special characters, 
> e.g.,
> {code}
>  |-- st: struct (nullable = true)
>  ||-- x.y: long (nullable = true)
> {code}
> Schema such as these are frequently generated by the JSON schema generator, 
> which seems to never want to map JSON data to {{MapType}} always preferring 
> to use {{StructType}}. 
> In SparkSQL, referring to these fields requires backticks, e.g., 
> {{st.`x.y`}}. There is no problem manipulating these structs unless one is 
> using an aggregation function. It seems that, under the covers, the code is 
> not escaping fields with special characters correctly.
> For example, 
> {code}
> select first(st) as st from tbl group by something
> {code}
> generates
> {code}
> org.apache.spark.sql.catalyst.util.DataTypeException: Unsupported dataType: 
> struct. If you have a struct and a field name of it has any 
> special characters, please use backticks (`) to quote that field name, e.g. 
> `x+y`. Please note that backtick itself is not supported in a field name.
>   at 
> org.apache.spark.sql.catalyst.util.DataTypeParser$class.toDataType(DataTypeParser.scala:100)
>   at 
> org.apache.spark.sql.catalyst.util.DataTypeParser$$anon$1.toDataType(DataTypeParser.scala:112)
>   at 
> org.apache.spark.sql.catalyst.util.DataTypeParser$.parse(DataTypeParser.scala:116)
>   at 
> org.apache.spark.sql.hive.HiveMetastoreTypes$.toDataType(HiveMetastoreCatalog.scala:884)
>   at 
> com.databricks.backend.daemon.driver.OutputAggregator$$anonfun$toJsonSchema$1.apply(OutputAggregator.scala:395)
>   at 
> com.databricks.backend.daemon.driver.OutputAggregator$$anonfun$toJsonSchema$1.apply(OutputAggregator.scala:394)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at 
> com.databricks.backend.daemon.driver.OutputAggregator$.toJsonSchema(OutputAggregator.scala:394)
>   at 
> com.databricks.backend.daemon.driver.OutputAggregator$.maybeApplyOutputAggregation(OutputAggregator.scala:122)
>   at 
> com.databricks.backend.daemon.driver.OutputAggregator$.withOutputAggregation0(OutputAggregator.scala:82)
>   at 
> com.databricks.backend.daemon.driver.OutputAggregator$.withOutputAggregation(OutputAggregator.scala:42)
>   at 
> com.databricks.backend.daemon.driver.DriverLocal.executeSql(DriverLocal.scala:306)
>   at 
> com.databricks.backend.daemon.driver.DriverLocal.execute(DriverLocal.scala:161)
>   at 
> com.databricks.backend.daemon.driver.DriverWrapper$$anonfun$3.apply(DriverWrapper.scala:467)
>   at 
> com.databricks.backend.daemon.driver.DriverWrapper$$anonfun$3.apply(DriverWrapper.scala:467)
>   at scala.util.Try$.apply(Try.scala:161)
>   at 
> com.databricks.backend.daemon.driver.DriverWrapper.executeCommand(DriverWrapper.scala:464)
>   at 
> com.databricks.backend.daemon.driver.DriverWrapper.runInner(DriverWrapper.scala:365)
>   at 
> com.databricks.backend.daemon.driver.DriverWrapper.run(DriverWrapper.scala:196)
>   at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14048) Aggregation operations on structs fail when the structs have fields with special characters

2016-06-15 Thread Simeon Simeonov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15333076#comment-15333076
 ] 

Simeon Simeonov commented on SPARK-14048:
-

Yes, I get the exact same failure with 1.6.1. 

> Aggregation operations on structs fail when the structs have fields with 
> special characters
> ---
>
> Key: SPARK-14048
> URL: https://issues.apache.org/jira/browse/SPARK-14048
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
> Environment: Databricks w/ 1.6.0
>Reporter: Simeon Simeonov
>  Labels: sql
> Attachments: bug_structs_with_backticks.html
>
>
> Consider a schema where a struct has field names with special characters, 
> e.g.,
> {code}
>  |-- st: struct (nullable = true)
>  ||-- x.y: long (nullable = true)
> {code}
> Schema such as these are frequently generated by the JSON schema generator, 
> which seems to never want to map JSON data to {{MapType}} always preferring 
> to use {{StructType}}. 
> In SparkSQL, referring to these fields requires backticks, e.g., 
> {{st.`x.y`}}. There is no problem manipulating these structs unless one is 
> using an aggregation function. It seems that, under the covers, the code is 
> not escaping fields with special characters correctly.
> For example, 
> {code}
> select first(st) as st from tbl group by something
> {code}
> generates
> {code}
> org.apache.spark.sql.catalyst.util.DataTypeException: Unsupported dataType: 
> struct. If you have a struct and a field name of it has any 
> special characters, please use backticks (`) to quote that field name, e.g. 
> `x+y`. Please note that backtick itself is not supported in a field name.
>   at 
> org.apache.spark.sql.catalyst.util.DataTypeParser$class.toDataType(DataTypeParser.scala:100)
>   at 
> org.apache.spark.sql.catalyst.util.DataTypeParser$$anon$1.toDataType(DataTypeParser.scala:112)
>   at 
> org.apache.spark.sql.catalyst.util.DataTypeParser$.parse(DataTypeParser.scala:116)
>   at 
> org.apache.spark.sql.hive.HiveMetastoreTypes$.toDataType(HiveMetastoreCatalog.scala:884)
>   at 
> com.databricks.backend.daemon.driver.OutputAggregator$$anonfun$toJsonSchema$1.apply(OutputAggregator.scala:395)
>   at 
> com.databricks.backend.daemon.driver.OutputAggregator$$anonfun$toJsonSchema$1.apply(OutputAggregator.scala:394)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at 
> com.databricks.backend.daemon.driver.OutputAggregator$.toJsonSchema(OutputAggregator.scala:394)
>   at 
> com.databricks.backend.daemon.driver.OutputAggregator$.maybeApplyOutputAggregation(OutputAggregator.scala:122)
>   at 
> com.databricks.backend.daemon.driver.OutputAggregator$.withOutputAggregation0(OutputAggregator.scala:82)
>   at 
> com.databricks.backend.daemon.driver.OutputAggregator$.withOutputAggregation(OutputAggregator.scala:42)
>   at 
> com.databricks.backend.daemon.driver.DriverLocal.executeSql(DriverLocal.scala:306)
>   at 
> com.databricks.backend.daemon.driver.DriverLocal.execute(DriverLocal.scala:161)
>   at 
> com.databricks.backend.daemon.driver.DriverWrapper$$anonfun$3.apply(DriverWrapper.scala:467)
>   at 
> com.databricks.backend.daemon.driver.DriverWrapper$$anonfun$3.apply(DriverWrapper.scala:467)
>   at scala.util.Try$.apply(Try.scala:161)
>   at 
> com.databricks.backend.daemon.driver.DriverWrapper.executeCommand(DriverWrapper.scala:464)
>   at 
> com.databricks.backend.daemon.driver.DriverWrapper.runInner(DriverWrapper.scala:365)
>   at 
> com.databricks.backend.daemon.driver.DriverWrapper.run(DriverWrapper.scala:196)
>   at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-14048) Aggregation operations on structs fail when the structs have fields with special characters

2016-06-15 Thread Simeon Simeonov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15333076#comment-15333076
 ] 

Simeon Simeonov edited comment on SPARK-14048 at 6/16/16 4:46 AM:
--

Yes, I get the exact same failure with 1.6.1 running on Databricks.


was (Author: simeons):
Yes, I get the exact same failure with 1.6.1. 

> Aggregation operations on structs fail when the structs have fields with 
> special characters
> ---
>
> Key: SPARK-14048
> URL: https://issues.apache.org/jira/browse/SPARK-14048
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
> Environment: Databricks w/ 1.6.0
>Reporter: Simeon Simeonov
>  Labels: sql
> Attachments: bug_structs_with_backticks.html
>
>
> Consider a schema where a struct has field names with special characters, 
> e.g.,
> {code}
>  |-- st: struct (nullable = true)
>  ||-- x.y: long (nullable = true)
> {code}
> Schema such as these are frequently generated by the JSON schema generator, 
> which seems to never want to map JSON data to {{MapType}} always preferring 
> to use {{StructType}}. 
> In SparkSQL, referring to these fields requires backticks, e.g., 
> {{st.`x.y`}}. There is no problem manipulating these structs unless one is 
> using an aggregation function. It seems that, under the covers, the code is 
> not escaping fields with special characters correctly.
> For example, 
> {code}
> select first(st) as st from tbl group by something
> {code}
> generates
> {code}
> org.apache.spark.sql.catalyst.util.DataTypeException: Unsupported dataType: 
> struct. If you have a struct and a field name of it has any 
> special characters, please use backticks (`) to quote that field name, e.g. 
> `x+y`. Please note that backtick itself is not supported in a field name.
>   at 
> org.apache.spark.sql.catalyst.util.DataTypeParser$class.toDataType(DataTypeParser.scala:100)
>   at 
> org.apache.spark.sql.catalyst.util.DataTypeParser$$anon$1.toDataType(DataTypeParser.scala:112)
>   at 
> org.apache.spark.sql.catalyst.util.DataTypeParser$.parse(DataTypeParser.scala:116)
>   at 
> org.apache.spark.sql.hive.HiveMetastoreTypes$.toDataType(HiveMetastoreCatalog.scala:884)
>   at 
> com.databricks.backend.daemon.driver.OutputAggregator$$anonfun$toJsonSchema$1.apply(OutputAggregator.scala:395)
>   at 
> com.databricks.backend.daemon.driver.OutputAggregator$$anonfun$toJsonSchema$1.apply(OutputAggregator.scala:394)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at 
> com.databricks.backend.daemon.driver.OutputAggregator$.toJsonSchema(OutputAggregator.scala:394)
>   at 
> com.databricks.backend.daemon.driver.OutputAggregator$.maybeApplyOutputAggregation(OutputAggregator.scala:122)
>   at 
> com.databricks.backend.daemon.driver.OutputAggregator$.withOutputAggregation0(OutputAggregator.scala:82)
>   at 
> com.databricks.backend.daemon.driver.OutputAggregator$.withOutputAggregation(OutputAggregator.scala:42)
>   at 
> com.databricks.backend.daemon.driver.DriverLocal.executeSql(DriverLocal.scala:306)
>   at 
> com.databricks.backend.daemon.driver.DriverLocal.execute(DriverLocal.scala:161)
>   at 
> com.databricks.backend.daemon.driver.DriverWrapper$$anonfun$3.apply(DriverWrapper.scala:467)
>   at 
> com.databricks.backend.daemon.driver.DriverWrapper$$anonfun$3.apply(DriverWrapper.scala:467)
>   at scala.util.Try$.apply(Try.scala:161)
>   at 
> com.databricks.backend.daemon.driver.DriverWrapper.executeCommand(DriverWrapper.scala:464)
>   at 
> com.databricks.backend.daemon.driver.DriverWrapper.runInner(DriverWrapper.scala:365)
>   at 
> com.databricks.backend.daemon.driver.DriverWrapper.run(DriverWrapper.scala:196)
>   at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13207) _SUCCESS should not break partition discovery

2016-06-10 Thread Simeon Simeonov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15325718#comment-15325718
 ] 

Simeon Simeonov commented on SPARK-13207:
-

[~yhuai] The PR associated with that ticket explicitly calls out {{_metadata}} 
and {{_common_metadata}} as not excluded. i am wondering why that PR will fix 
this issue... Can you add a test to demonstrate that this is fixed?

> _SUCCESS should not break partition discovery
> -
>
> Key: SPARK-13207
> URL: https://issues.apache.org/jira/browse/SPARK-13207
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Yin Huai
>  Labels: backport-needed
> Fix For: 1.6.2, 2.0.0
>
>
> Partitioning discovery will fail with the following case
> {code}
> test("_SUCCESS should not break partitioning discovery") {
> withTempPath { dir =>
>   val tablePath = new File(dir, "table")
>   val df = (1 to 3).map(i => (i, i, i, i)).toDF("a", "b", "c", "d")
>   df.write
> .format("parquet")
> .partitionBy("b", "c", "d")
> .save(tablePath.getCanonicalPath)
>   Files.touch(new File(s"${tablePath.getCanonicalPath}/b=1", "_SUCCESS"))
>   Files.touch(new File(s"${tablePath.getCanonicalPath}/b=1/c=1", 
> "_SUCCESS"))
>   Files.touch(new File(s"${tablePath.getCanonicalPath}/b=1/c=1/d=1", 
> "_SUCCESS"))
>   
> checkAnswer(sqlContext.read.format("parquet").load(tablePath.getCanonicalPath),
>  df)
> }
>   }
> {code}
> Because {{_SUCCESS}} is the in the inner partitioning dirs, partitioning 
> discovery will fail.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13207) _SUCCESS should not break partition discovery

2016-05-28 Thread Simeon Simeonov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15305703#comment-15305703
 ] 

Simeon Simeonov commented on SPARK-13207:
-

[~yhuai] I see the same problem with common metadata files. Do we need another 
JIRA issue for those? 

For example, the following S3 directory structure:

{code}
2016-05-28 20:08:18  41207 
ss/tests/partitioning/placements/par_ts=201605260400/_common_metadata
2016-05-28 20:08:173981760 
ss/tests/partitioning/placements/par_ts=201605260400/_metadata
2016-05-28 20:06:10   26149863 
ss/tests/partitioning/placements/par_ts=201605260400/par_job=load/par_cat=engagement/part-r-1-85d6c96a-4384-4e43-a575-ac71e147f349.gz.parquet
2016-05-28 20:06:42   32882968 
ss/tests/partitioning/placements/par_ts=201605260400/par_job=load/par_cat=error/part-r-2-85d6c96a-4384-4e43-a575-ac71e147f349.gz.parquet
2016-05-28 20:05:491700553 
ss/tests/partitioning/placements/par_ts=201605260400/par_job=load/par_cat=late/part-r-0-85d6c96a-4384-4e43-a575-ac71e147f349.gz.parquet
2016-05-28 20:04:21  41207 
ss/tests/partitioning/placements/par_ts=201605270400/_common_metadata
2016-05-28 20:04:204120453 
ss/tests/partitioning/placements/par_ts=201605270400/_metadata
2016-05-28 20:02:37   21471845 
ss/tests/partitioning/placements/par_ts=201605270400/par_job=load/par_cat=engagement/part-r-00028-ad451b48-e48c-46eb-9e5f-c1bc4c663a4c.gz.parquet
2016-05-28 20:03:12   29981797 
ss/tests/partitioning/placements/par_ts=201605270400/par_job=load/par_cat=error/part-r-00029-ad451b48-e48c-46eb-9e5f-c1bc4c663a4c.gz.parquet
2016-05-28 20:02:291525027 
ss/tests/partitioning/placements/par_ts=201605270400/par_job=load/par_cat=late/part-r-00025-ad451b48-e48c-46eb-9e5f-c1bc4c663a4c.gz.parquet
{code}

generates the following partition discovery exception when loading 
{{/ss/tests/partitioning/placements}}:

{code}
java.lang.AssertionError: assertion failed: Conflicting partition column names 
detected:

Partition column name list #0: par_ts, par_job, par_cat
Partition column name list #1: par_ts

For partitioned table directories, data files should only live in leaf 
directories.
And directories at the same level should have the same partition column name.
Please check the following directories for unexpected files or inconsistent 
partition column names:


dbfs:/mnt/swoop-spark-play/ss/tests/partitioning/placements/par_ts=201605270400

dbfs:/mnt/swoop-spark-play/ss/tests/partitioning/placements/par_ts=201605260400

dbfs:/mnt/swoop-spark-play/ss/tests/partitioning/placements/par_ts=201605260400/par_job=load/par_cat=late

dbfs:/mnt/swoop-spark-play/ss/tests/partitioning/placements/par_ts=201605270400/par_job=load/par_cat=late

dbfs:/mnt/swoop-spark-play/ss/tests/partitioning/placements/par_ts=201605260400/par_job=load/par_cat=error

dbfs:/mnt/swoop-spark-play/ss/tests/partitioning/placements/par_ts=201605270400/par_job=load/par_cat=error

dbfs:/mnt/swoop-spark-play/ss/tests/partitioning/placements/par_ts=201605270400/par_job=load/par_cat=engagement

dbfs:/mnt/swoop-spark-play/ss/tests/partitioning/placements/par_ts=201605260400/par_job=load/par_cat=engagement
at scala.Predef$.assert(Predef.scala:179)
at 
org.apache.spark.sql.execution.datasources.PartitioningUtils$.resolvePartitions(PartitioningUtils.scala:246)
at 
org.apache.spark.sql.execution.datasources.PartitioningUtils$.parsePartitions(PartitioningUtils.scala:115)
at 
org.apache.spark.sql.sources.HadoopFsRelation.org$apache$spark$sql$sources$HadoopFsRelation$$discoverPartitions(interfaces.scala:621)
at 
org.apache.spark.sql.sources.HadoopFsRelation$$anonfun$partitionSpec$3.apply(interfaces.scala:526)
at 
org.apache.spark.sql.sources.HadoopFsRelation$$anonfun$partitionSpec$3.apply(interfaces.scala:525)
at scala.Option.getOrElse(Option.scala:120)
at 
org.apache.spark.sql.sources.HadoopFsRelation.partitionSpec(interfaces.scala:524)
at 
org.apache.spark.sql.sources.HadoopFsRelation$$anonfun$partitionColumns$1.apply(interfaces.scala:578)
at 
org.apache.spark.sql.sources.HadoopFsRelation$$anonfun$partitionColumns$1.apply(interfaces.scala:578)
at scala.Option.getOrElse(Option.scala:120)
at 
org.apache.spark.sql.sources.HadoopFsRelation.partitionColumns(interfaces.scala:578)
at 
org.apache.spark.sql.sources.HadoopFsRelation.schema$lzycompute(interfaces.scala:637)
at 
org.apache.spark.sql.sources.HadoopFsRelation.schema(interfaces.scala:635)
at 
org.apache.spark.sql.execution.datasources.LogicalRelation.(LogicalRelation.scala:37)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:125)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:109)
{code}

> _SUCCESS should not break partition discovery

[jira] [Comment Edited] (SPARK-10574) HashingTF should use MurmurHash3

2016-04-19 Thread Simeon Simeonov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15248627#comment-15248627
 ] 

Simeon Simeonov edited comment on SPARK-10574 at 4/19/16 9:05 PM:
--

[~josephkb] I agree that it would be an improvement. The issue I see with the 
current patch is that it would be an incompatible API change in the future 
(specifying hashing functions as objects and not by name). If we make just this 
one change everything else can be handled with no API changes, e.g., seeds are 
just constructor parameters or closure variables available to the hashing 
function and collision detection is just decoration. 

That's my practical argument related to MLlib. 

Beyond that, there are multiple arguments related to the usability, testability 
and maintainability of the Spark codebase, which has high code change velocity 
from a large number of committers, which contributes to a high issue rate. The 
simplest way to battle this is one design decision at a time. The PR hard-codes 
what is essentially a strategy pattern in the implementation of an object. It 
conflates responsibilities. It introduces branching this makes testing and 
documentation more complicated. If hashing functions are externalized, they 
could be trivially tested. If {{HashingTF}} took a {{Function1[Any, Int]}} as 
input it could also be tested much more simply with any function. The behavior 
and the APIs become simpler to document because they do one thing. Etc. 

Perhaps I'm only seeing the benefits of externalizing the hashing strategy and 
missing the complexity in what I propose? We have ample examples of Spark APIs 
using functions as inputs so there are standard ways to handle this across 
languages. We don't need a custom trait if we stick to {{Any}} as the hashing 
function input. What else could be a problem?


was (Author: simeons):
[~josephkb] I agree that it would be an improvement. The issue I see with the 
current patch is that it would be an incompatible API change in the future 
(specifying hashing functions as objects and not by name). If we make just this 
one change everything else can be handled with no API changes, e.g., seeds are 
just constructor parameters or closure variables available to the hashing 
function and collision detection is just decoration. 

That's my practical argument related to MLlib. 

Beyond that, there are multiple arguments related to the usability, testability 
and maintainability of the Spark codebase, which has high code change velocity 
from a large number of committers, which contributes to a high issue rate. The 
simplest way to battle this is one design decision at a time. The PR hard-codes 
what is essentially a strategy pattern in the implementation of an object. It 
conflates responsibilities. It introduces branching this makes testing and 
documentation more complicated. If hashing functions are externalized, they 
could be trivially tested. If {{HashingTF}} took a {{Function1[Any, Int]}} as 
input it could also be tested much more simply with any function. The 
documentation and the APIs become simpler to document because they do one 
thing. Etc. 

Perhaps I'm only seeing the benefits of externalizing the hashing strategy and 
missing the complexity in what I propose? We have ample examples of Spark APIs 
using functions as inputs so there are standard ways to handle this across 
languages. We don't need a custom trait if we stick to {{Any}} as the hashing 
function input. What else could be a problem?

> HashingTF should use MurmurHash3
> 
>
> Key: SPARK-10574
> URL: https://issues.apache.org/jira/browse/SPARK-10574
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib
>Affects Versions: 1.5.0
>Reporter: Simeon Simeonov
>Assignee: Yanbo Liang
>  Labels: HashingTF, hashing, mllib
>
> {{HashingTF}} uses the Scala native hashing {{##}} implementation. There are 
> two significant problems with this.
> First, per the [Scala 
> documentation|http://www.scala-lang.org/api/2.10.4/index.html#scala.Any] for 
> {{hashCode}}, the implementation is platform specific. This means that 
> feature vectors created on one platform may be different than vectors created 
> on another platform. This can create significant problems when a model 
> trained offline is used in another environment for online prediction. The 
> problem is made harder by the fact that following a hashing transform 
> features lose human-tractable meaning and a problem such as this may be 
> extremely difficult to track down.
> Second, the native Scala hashing function performs badly on longer strings, 
> exhibiting [200-500% higher collision 
> rates|https://gist.github.com/ssimeonov/eb12fcda75615e4a8d46] than, for 
> example, 
> 

[jira] [Commented] (SPARK-10574) HashingTF should use MurmurHash3

2016-04-19 Thread Simeon Simeonov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15248627#comment-15248627
 ] 

Simeon Simeonov commented on SPARK-10574:
-

[~josephkb] I agree that it would be an improvement. The issue I see with the 
current patch is that it would be an incompatible API change in the future 
(specifying hashing functions as objects and not by name). If we make just this 
one change everything else can be handled with no API changes, e.g., seeds are 
just constructor parameters or closure variables available to the hashing 
function and collision detection is just decoration. 

That's my practical argument related to MLlib. 

Beyond that, there are multiple arguments related to the usability, testability 
and maintainability of the Spark codebase, which has high code change velocity 
from a large number of committers, which contributes to a high issue rate. The 
simplest way to battle this is one design decision at a time. The PR hard-codes 
what is essentially a strategy pattern in the implementation of an object. It 
conflates responsibilities. It introduces branching this makes testing and 
documentation more complicated. If hashing functions are externalized, they 
could be trivially tested. If {{HashingTF}} took a {{Function1[Any, Int]}} as 
input it could also be tested much more simply with any function. The 
documentation and the APIs become simpler to document because they do one 
thing. Etc. 

Perhaps I'm only seeing the benefits of externalizing the hashing strategy and 
missing the complexity in what I propose? We have ample examples of Spark APIs 
using functions as inputs so there are standard ways to handle this across 
languages. We don't need a custom trait if we stick to {{Any}} as the hashing 
function input. What else could be a problem?

> HashingTF should use MurmurHash3
> 
>
> Key: SPARK-10574
> URL: https://issues.apache.org/jira/browse/SPARK-10574
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib
>Affects Versions: 1.5.0
>Reporter: Simeon Simeonov
>Assignee: Yanbo Liang
>  Labels: HashingTF, hashing, mllib
>
> {{HashingTF}} uses the Scala native hashing {{##}} implementation. There are 
> two significant problems with this.
> First, per the [Scala 
> documentation|http://www.scala-lang.org/api/2.10.4/index.html#scala.Any] for 
> {{hashCode}}, the implementation is platform specific. This means that 
> feature vectors created on one platform may be different than vectors created 
> on another platform. This can create significant problems when a model 
> trained offline is used in another environment for online prediction. The 
> problem is made harder by the fact that following a hashing transform 
> features lose human-tractable meaning and a problem such as this may be 
> extremely difficult to track down.
> Second, the native Scala hashing function performs badly on longer strings, 
> exhibiting [200-500% higher collision 
> rates|https://gist.github.com/ssimeonov/eb12fcda75615e4a8d46] than, for 
> example, 
> [MurmurHash3|http://www.scala-lang.org/api/2.10.4/#scala.util.hashing.MurmurHash3$]
>  which is also included in the standard Scala libraries and is the hashing 
> choice of fast learners such as Vowpal Wabbit, scikit-learn and others. If 
> Spark users apply {{HashingTF}} only to very short, dictionary-like strings 
> the hashing function choice will not be a big problem but why have an 
> implementation in MLlib with this limitation when there is a better 
> implementation readily available in the standard Scala library?
> Switching to MurmurHash3 solves both problems. If there is agreement that 
> this is a good change, I can prepare a PR. 
> Note that changing the hash function would mean that models saved with a 
> previous version would have to be re-trained. This introduces a problem 
> that's orthogonal to breaking changes in APIs: breaking changes related to 
> artifacts, e.g., a saved model, produced by a previous version. Is there a 
> policy or best practice currently in effect about this? If not, perhaps we 
> should come up with a few simple rules about how we communicate these in 
> release notes, etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14443) parse_url() does not escape query parameters

2016-04-06 Thread Simeon Simeonov (JIRA)
Simeon Simeonov created SPARK-14443:
---

 Summary: parse_url() does not escape query parameters
 Key: SPARK-14443
 URL: https://issues.apache.org/jira/browse/SPARK-14443
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.6.0
 Environment: Databricks
Reporter: Simeon Simeonov


To reproduce, run the following SparkSQL statement:

{code}
select 
parse_url('http://1168.xg4ken.com/media/redir.php?prof=457=67116=kw54_inner_url_encoded=1=adwords=Desktop[]=http%3A%2F%2Fwww.landroverusa.com%2Fvehicles%2Frange-rover-sport-off-road-suv%2Findex.html%3Futm_content%3Dcontent%26utm_source%fb%26utm_medium%3Dcpc%26utm_term%3DAdwords_Brand_Range_Rover_Sport%26utm_campaign%3DFB_Land_Rover_Brand',
 'QUERY', 'url[]')
{code}

The exception is ultimately caused by

{code}
java.util.regex.PatternSyntaxException: Unclosed character class near index 17
(&|^)url[]=([^&]*)
 ^
{code}

Looks like the code is building a regex internally without escaping the passed 
in query parameter name.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-14048) Aggregation operations on structs fail when the structs have fields with special characters

2016-03-24 Thread Simeon Simeonov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15210213#comment-15210213
 ] 

Simeon Simeonov edited comment on SPARK-14048 at 3/24/16 1:31 PM:
--

You can see the problem on a 1.6.0 cluster in the Databricks environment in the 
attached file ({{bug_structs_with_backticks.html}}). Perhaps the issue is not 
as simple as I assumed it to be and may require more complex types?


was (Author: simeons):
You can see the problem on a 1.6.0 cluster in the Databricks environment in the 
attached file. Perhaps the issue is not as simple as I assumed it to be and may 
require more complex types?

> Aggregation operations on structs fail when the structs have fields with 
> special characters
> ---
>
> Key: SPARK-14048
> URL: https://issues.apache.org/jira/browse/SPARK-14048
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
> Environment: Databricks w/ 1.6.0
>Reporter: Simeon Simeonov
>  Labels: sql
> Attachments: bug_structs_with_backticks.html
>
>
> Consider a schema where a struct has field names with special characters, 
> e.g.,
> {code}
>  |-- st: struct (nullable = true)
>  ||-- x.y: long (nullable = true)
> {code}
> Schema such as these are frequently generated by the JSON schema generator, 
> which seems to never want to map JSON data to {{MapType}} always preferring 
> to use {{StructType}}. 
> In SparkSQL, referring to these fields requires backticks, e.g., 
> {{st.`x.y`}}. There is no problem manipulating these structs unless one is 
> using an aggregation function. It seems that, under the covers, the code is 
> not escaping fields with special characters correctly.
> For example, 
> {code}
> select first(st) as st from tbl group by something
> {code}
> generates
> {code}
> org.apache.spark.sql.catalyst.util.DataTypeException: Unsupported dataType: 
> struct. If you have a struct and a field name of it has any 
> special characters, please use backticks (`) to quote that field name, e.g. 
> `x+y`. Please note that backtick itself is not supported in a field name.
>   at 
> org.apache.spark.sql.catalyst.util.DataTypeParser$class.toDataType(DataTypeParser.scala:100)
>   at 
> org.apache.spark.sql.catalyst.util.DataTypeParser$$anon$1.toDataType(DataTypeParser.scala:112)
>   at 
> org.apache.spark.sql.catalyst.util.DataTypeParser$.parse(DataTypeParser.scala:116)
>   at 
> org.apache.spark.sql.hive.HiveMetastoreTypes$.toDataType(HiveMetastoreCatalog.scala:884)
>   at 
> com.databricks.backend.daemon.driver.OutputAggregator$$anonfun$toJsonSchema$1.apply(OutputAggregator.scala:395)
>   at 
> com.databricks.backend.daemon.driver.OutputAggregator$$anonfun$toJsonSchema$1.apply(OutputAggregator.scala:394)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at 
> com.databricks.backend.daemon.driver.OutputAggregator$.toJsonSchema(OutputAggregator.scala:394)
>   at 
> com.databricks.backend.daemon.driver.OutputAggregator$.maybeApplyOutputAggregation(OutputAggregator.scala:122)
>   at 
> com.databricks.backend.daemon.driver.OutputAggregator$.withOutputAggregation0(OutputAggregator.scala:82)
>   at 
> com.databricks.backend.daemon.driver.OutputAggregator$.withOutputAggregation(OutputAggregator.scala:42)
>   at 
> com.databricks.backend.daemon.driver.DriverLocal.executeSql(DriverLocal.scala:306)
>   at 
> com.databricks.backend.daemon.driver.DriverLocal.execute(DriverLocal.scala:161)
>   at 
> com.databricks.backend.daemon.driver.DriverWrapper$$anonfun$3.apply(DriverWrapper.scala:467)
>   at 
> com.databricks.backend.daemon.driver.DriverWrapper$$anonfun$3.apply(DriverWrapper.scala:467)
>   at scala.util.Try$.apply(Try.scala:161)
>   at 
> com.databricks.backend.daemon.driver.DriverWrapper.executeCommand(DriverWrapper.scala:464)
>   at 
> com.databricks.backend.daemon.driver.DriverWrapper.runInner(DriverWrapper.scala:365)
>   at 
> com.databricks.backend.daemon.driver.DriverWrapper.run(DriverWrapper.scala:196)
>   at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14048) Aggregation operations on structs fail when the structs have fields with special characters

2016-03-24 Thread Simeon Simeonov (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Simeon Simeonov updated SPARK-14048:

Attachment: bug_structs_with_backticks.html

You can see the problem on a 1.6.0 cluster in the Databricks environment in the 
attached file. Perhaps the issue is not as simple as I assumed it to be and may 
require more complex types?

> Aggregation operations on structs fail when the structs have fields with 
> special characters
> ---
>
> Key: SPARK-14048
> URL: https://issues.apache.org/jira/browse/SPARK-14048
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
> Environment: Databricks w/ 1.6.0
>Reporter: Simeon Simeonov
>  Labels: sql
> Attachments: bug_structs_with_backticks.html
>
>
> Consider a schema where a struct has field names with special characters, 
> e.g.,
> {code}
>  |-- st: struct (nullable = true)
>  ||-- x.y: long (nullable = true)
> {code}
> Schema such as these are frequently generated by the JSON schema generator, 
> which seems to never want to map JSON data to {{MapType}} always preferring 
> to use {{StructType}}. 
> In SparkSQL, referring to these fields requires backticks, e.g., 
> {{st.`x.y`}}. There is no problem manipulating these structs unless one is 
> using an aggregation function. It seems that, under the covers, the code is 
> not escaping fields with special characters correctly.
> For example, 
> {code}
> select first(st) as st from tbl group by something
> {code}
> generates
> {code}
> org.apache.spark.sql.catalyst.util.DataTypeException: Unsupported dataType: 
> struct. If you have a struct and a field name of it has any 
> special characters, please use backticks (`) to quote that field name, e.g. 
> `x+y`. Please note that backtick itself is not supported in a field name.
>   at 
> org.apache.spark.sql.catalyst.util.DataTypeParser$class.toDataType(DataTypeParser.scala:100)
>   at 
> org.apache.spark.sql.catalyst.util.DataTypeParser$$anon$1.toDataType(DataTypeParser.scala:112)
>   at 
> org.apache.spark.sql.catalyst.util.DataTypeParser$.parse(DataTypeParser.scala:116)
>   at 
> org.apache.spark.sql.hive.HiveMetastoreTypes$.toDataType(HiveMetastoreCatalog.scala:884)
>   at 
> com.databricks.backend.daemon.driver.OutputAggregator$$anonfun$toJsonSchema$1.apply(OutputAggregator.scala:395)
>   at 
> com.databricks.backend.daemon.driver.OutputAggregator$$anonfun$toJsonSchema$1.apply(OutputAggregator.scala:394)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at 
> com.databricks.backend.daemon.driver.OutputAggregator$.toJsonSchema(OutputAggregator.scala:394)
>   at 
> com.databricks.backend.daemon.driver.OutputAggregator$.maybeApplyOutputAggregation(OutputAggregator.scala:122)
>   at 
> com.databricks.backend.daemon.driver.OutputAggregator$.withOutputAggregation0(OutputAggregator.scala:82)
>   at 
> com.databricks.backend.daemon.driver.OutputAggregator$.withOutputAggregation(OutputAggregator.scala:42)
>   at 
> com.databricks.backend.daemon.driver.DriverLocal.executeSql(DriverLocal.scala:306)
>   at 
> com.databricks.backend.daemon.driver.DriverLocal.execute(DriverLocal.scala:161)
>   at 
> com.databricks.backend.daemon.driver.DriverWrapper$$anonfun$3.apply(DriverWrapper.scala:467)
>   at 
> com.databricks.backend.daemon.driver.DriverWrapper$$anonfun$3.apply(DriverWrapper.scala:467)
>   at scala.util.Try$.apply(Try.scala:161)
>   at 
> com.databricks.backend.daemon.driver.DriverWrapper.executeCommand(DriverWrapper.scala:464)
>   at 
> com.databricks.backend.daemon.driver.DriverWrapper.runInner(DriverWrapper.scala:365)
>   at 
> com.databricks.backend.daemon.driver.DriverWrapper.run(DriverWrapper.scala:196)
>   at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14048) Aggregation operations on structs fail when the structs have fields with special characters

2016-03-21 Thread Simeon Simeonov (JIRA)
Simeon Simeonov created SPARK-14048:
---

 Summary: Aggregation operations on structs fail when the structs 
have fields with special characters
 Key: SPARK-14048
 URL: https://issues.apache.org/jira/browse/SPARK-14048
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.6.0
 Environment: Databricks w/ 1.6.0
Reporter: Simeon Simeonov


Consider a schema where a struct has field names with special characters, e.g.,

{code}
 |-- st: struct (nullable = true)
 ||-- x.y: long (nullable = true)
{code}

Schema such as these are frequently generated by the JSON schema generator, 
which seems to never want to map JSON data to {{MapType}} always preferring to 
use {{StructType}}. 

In SparkSQL, referring to these fields requires backticks, e.g., {{st.`x.y`}}. 
There is no problem manipulating these structs unless one is using an 
aggregation function. It seems that, under the covers, the code is not escaping 
fields with special characters correctly.

For example, 

{code}
select first(st) as st from tbl group by something
{code}

generates

{code}
org.apache.spark.sql.catalyst.util.DataTypeException: Unsupported dataType: 
struct. If you have a struct and a field name of it has any special 
characters, please use backticks (`) to quote that field name, e.g. `x+y`. 
Please note that backtick itself is not supported in a field name.
  at 
org.apache.spark.sql.catalyst.util.DataTypeParser$class.toDataType(DataTypeParser.scala:100)
  at 
org.apache.spark.sql.catalyst.util.DataTypeParser$$anon$1.toDataType(DataTypeParser.scala:112)
  at 
org.apache.spark.sql.catalyst.util.DataTypeParser$.parse(DataTypeParser.scala:116)
  at 
org.apache.spark.sql.hive.HiveMetastoreTypes$.toDataType(HiveMetastoreCatalog.scala:884)
  at 
com.databricks.backend.daemon.driver.OutputAggregator$$anonfun$toJsonSchema$1.apply(OutputAggregator.scala:395)
  at 
com.databricks.backend.daemon.driver.OutputAggregator$$anonfun$toJsonSchema$1.apply(OutputAggregator.scala:394)
  at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
  at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
  at scala.collection.immutable.List.foreach(List.scala:318)
  at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
  at scala.collection.AbstractTraversable.map(Traversable.scala:105)
  at 
com.databricks.backend.daemon.driver.OutputAggregator$.toJsonSchema(OutputAggregator.scala:394)
  at 
com.databricks.backend.daemon.driver.OutputAggregator$.maybeApplyOutputAggregation(OutputAggregator.scala:122)
  at 
com.databricks.backend.daemon.driver.OutputAggregator$.withOutputAggregation0(OutputAggregator.scala:82)
  at 
com.databricks.backend.daemon.driver.OutputAggregator$.withOutputAggregation(OutputAggregator.scala:42)
  at 
com.databricks.backend.daemon.driver.DriverLocal.executeSql(DriverLocal.scala:306)
  at 
com.databricks.backend.daemon.driver.DriverLocal.execute(DriverLocal.scala:161)
  at 
com.databricks.backend.daemon.driver.DriverWrapper$$anonfun$3.apply(DriverWrapper.scala:467)
  at 
com.databricks.backend.daemon.driver.DriverWrapper$$anonfun$3.apply(DriverWrapper.scala:467)
  at scala.util.Try$.apply(Try.scala:161)
  at 
com.databricks.backend.daemon.driver.DriverWrapper.executeCommand(DriverWrapper.scala:464)
  at 
com.databricks.backend.daemon.driver.DriverWrapper.runInner(DriverWrapper.scala:365)
  at 
com.databricks.backend.daemon.driver.DriverWrapper.run(DriverWrapper.scala:196)
  at java.lang.Thread.run(Thread.java:745)
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13445) Seleting "data" with window function does not work unless aliased (using PARTITION BY)

2016-02-25 Thread Simeon Simeonov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15168527#comment-15168527
 ] 

Simeon Simeonov commented on SPARK-13445:
-

I can reproduce it consistently in our 1.6.0 Databricks environment. See 
http://www.screencast.com/t/iDVYagvqA

> Seleting "data" with window function does not work unless aliased (using 
> PARTITION BY)
> --
>
> Key: SPARK-13445
> URL: https://issues.apache.org/jira/browse/SPARK-13445
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Reynold Xin
>Priority: Critical
>
> The code does not throw an exception if "data" is aliased.  Maybe this is a 
> reserved word or aliases are just required when using PARTITION BY?
> {code}
> sql("""
>   SELECT 
> data as the_data,
> row_number() over (partition BY data.type) AS foo
>   FROM event_record_sample
> """)
> {code}
> However, this code throws an error:
> {code}
> sql("""
>   SELECT 
> data,
> row_number() over (partition BY data.type) AS foo
>   FROM event_record_sample
> """)
> {code}
> {code}
> org.apache.spark.sql.AnalysisException: resolved attribute(s) type#15246 
> missing from 
> data#15107,par_cat#15112,schemaMajorVersion#15110,source#15108,recordId#15103,features#15106,eventType#15105,ts#15104L,schemaMinorVersion#15111,issues#15109
>  in operator !Project [data#15107,type#15246];
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:38)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:183)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:50)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:105)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:104)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:104)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:104)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:104)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:104)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:104)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:104)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:104)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:104)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:50)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44)
>   at 
> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:34)
>   at org.apache.spark.sql.DataFrame.(DataFrame.scala:133)
>   at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:52)
>   at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:816)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12890) Spark SQL query related to only partition fields should not scan the whole data.

2016-01-25 Thread Simeon Simeonov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15115558#comment-15115558
 ] 

Simeon Simeonov commented on SPARK-12890:
-

[~viirya] If schema merging is the cause of the problem then this is clearly a 
bug. The resulting schema for a query using only partition columns is 
completely independent of the schema in the data files. There is no merging to 
do at all.

> Spark SQL query related to only partition fields should not scan the whole 
> data.
> 
>
> Key: SPARK-12890
> URL: https://issues.apache.org/jira/browse/SPARK-12890
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Prakash Chockalingam
>
> I have a SQL query which has only partition fields. The query ends up 
> scanning all the data which is unnecessary.
> Example: select max(date) from table, where the table is partitioned by date.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12890) Spark SQL query related to only partition fields should not scan the whole data.

2016-01-24 Thread Simeon Simeonov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15114829#comment-15114829
 ] 

Simeon Simeonov commented on SPARK-12890:
-

Thanks for the clarification, [~hyukjin.kwon]. Still, there is no reason why it 
should be looking at the files at all. This is especially a problem when the 
Parquet files are in an object store such as S3, because there is no such thing 
as reading the footer of an S3 object. 

> Spark SQL query related to only partition fields should not scan the whole 
> data.
> 
>
> Key: SPARK-12890
> URL: https://issues.apache.org/jira/browse/SPARK-12890
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Prakash Chockalingam
>
> I have a SQL query which has only partition fields. The query ends up 
> scanning all the data which is unnecessary.
> Example: select max(date) from table, where the table is partitioned by date.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-12890) Spark SQL query related to only partition fields should not scan the whole data.

2016-01-18 Thread Simeon Simeonov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15106124#comment-15106124
 ] 

Simeon Simeonov edited comment on SPARK-12890 at 1/19/16 2:07 AM:
--

I've experienced this issue with a multi-level partitioned table loaded via 
{{sqlContext.read.parquet()}}. I'm not sure Spark is actually reading any data 
from the Parquet files but it does look at every Parquet file (perhaps reading 
meta-data?). I discovered this by accident because I had invalid Parquet files 
in the table tree left over from a failed job. Spark errored, which surprised 
me as I would have expected it to not look at any of the data when the query 
could be satisfied entirely through the partition columns. 

This is an important issue because it affects query speed for very large 
partitioned tables.


was (Author: simeons):
I've experienced this issue with a multi-level partitioned table loaded via 
`sqlContext.read.parquet()`. I'm not sure Spark is actually reading any data 
from the Parquet files but it does look at every Parquet file (perhaps reading 
meta-data?). I discovered this by accident because I had invalid Parquet files 
in the table tree left over from a failed job. Spark errored, which surprised 
me as I would have expected it to not look at any of the data when the query 
could be satisfied entirely through the partition columns. 

This is an important issue because it affects query speed for very large 
partitioned tables.

> Spark SQL query related to only partition fields should not scan the whole 
> data.
> 
>
> Key: SPARK-12890
> URL: https://issues.apache.org/jira/browse/SPARK-12890
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Prakash Chockalingam
>
> I have a SQL query which has only partition fields. The query ends up 
> scanning all the data which is unnecessary.
> Example: select max(date) from table, where the table is partitioned by date.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12890) Spark SQL query related to only partition fields should not scan the whole data.

2016-01-18 Thread Simeon Simeonov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15106124#comment-15106124
 ] 

Simeon Simeonov commented on SPARK-12890:
-

I've experienced this issue with a multi-level partitioned table loaded via 
`sqlContext.read.parquet()`. I'm not sure Spark is actually reading any data 
from the Parquet files but it does look at every Parquet file (perhaps reading 
meta-data?). I discovered this by accident because I had invalid Parquet files 
in the table tree left over from a failed job. Spark errored, which surprised 
me as I would have expected it to not look at any of the data when the query 
could be satisfied entirely through the partition columns. 

This is an important issue because it affects query speed for very large 
partitioned tables.

> Spark SQL query related to only partition fields should not scan the whole 
> data.
> 
>
> Key: SPARK-12890
> URL: https://issues.apache.org/jira/browse/SPARK-12890
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Prakash Chockalingam
>
> I have a SQL query which has only partition fields. The query ends up 
> scanning all the data which is unnecessary.
> Example: select max(date) from table, where the table is partitioned by date.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7301) Issue with duplicated fields in interpreted json schemas

2015-12-18 Thread Simeon Simeonov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15065021#comment-15065021
 ] 

Simeon Simeonov commented on SPARK-7301:


It would be nice if using backticks to refer to the column resolved the 
ambiguity, e.g., {{select `A` as upperA, `a` as lowerA from test}}.

> Issue with duplicated fields in interpreted json schemas
> 
>
> Key: SPARK-7301
> URL: https://issues.apache.org/jira/browse/SPARK-7301
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: David Crossland
>
> I have a large json dataset that has evolved over time as such some fields 
> seem to have slight renames or have been capitalised in some way.  This means 
> there are certain fields that spark considers ambiguous when i attempt to 
> access them 
> i get a 
> org.apache.spark.sql.AnalysisException: Ambiguous reference to fields 
> StructField(Currency,StringType,true), StructField(currency,StringType,true);
> error
> There appears to be no way to resolve an ambiguous field after its been 
> inferred by spark sql other than to manually construct the schema using 
> StructType/StructField which is a bit heavy handed as the schema is quite 
> large.  Is there some way to resolve an ambiguous reference? or affect the 
> schema post inference? It seems like something of a bug that i cant tell 
> spark to treat both fields as though they were the same.  Ive created a test 
> where i manually defined a schema as 
> val schema = StructType(Seq(StructField("A", StringType, true)))
> And it returns 2 rows when i perform a count on the following dataset
> {"A":"test1"}
> {"a":"test2"}
> If i could modify the schema to remove the duplicate entries then i could 
> work around this issue.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12022) spark-shell cannot run on master created with spark-ec2

2015-11-26 Thread Simeon Simeonov (JIRA)
Simeon Simeonov created SPARK-12022:
---

 Summary: spark-shell cannot run on master created with spark-ec2
 Key: SPARK-12022
 URL: https://issues.apache.org/jira/browse/SPARK-12022
 Project: Spark
  Issue Type: Bug
  Components: EC2, Spark Shell, SQL
Affects Versions: 1.5.1
 Environment: AWS EC2
Reporter: Simeon Simeonov


Running {{./bin/spark-shell}} on the master node created with {{spark-ec2}} 
results in:

{code}
java.lang.RuntimeException: java.lang.RuntimeException: The root scratch dir: 
/tmp/hive on HDFS should be writable. Current permissions are: rwx--x--x
at 
org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:522)
at 
org.apache.spark.sql.hive.client.ClientWrapper.(ClientWrapper.scala:171)
at 
org.apache.spark.sql.hive.HiveContext.executionHive$lzycompute(HiveContext.scala:162)
at 
org.apache.spark.sql.hive.HiveContext.executionHive(HiveContext.scala:160)
at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:167)
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-11770) Spark SQL field resolution error in GROUP BY HAVING clause

2015-11-24 Thread Simeon Simeonov (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Simeon Simeonov closed SPARK-11770.
---
Resolution: Cannot Reproduce

[~smilegator] I cannot reproduce the problem under v1.5.2. Closing.

> Spark SQL field resolution error in GROUP BY HAVING clause
> --
>
> Key: SPARK-11770
> URL: https://issues.apache.org/jira/browse/SPARK-11770
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Simeon Simeonov
>  Labels: SQL
>
> A query fails to resolve columns from the source data when an alias is added 
> to the SELECT clause. I have not been able to isolate a reproducible 
> standalone test. Below are a series of {{spark-shell}} operations that show 
> the problem step-by-step. Spark SQL execution happens via {{HiveContext}}.
> I believe the root cause of the problem is that when (and only when) there 
> are aliased expression columns in the SELECT clause, Spark SQL "sees" columns 
> from the SELECT clause while evaluating a HAVING clause. According to the SQL 
> standard that should not happen.
> The table in question is simple:
> {code}
> scala> ctx.table("hevents_test").printSchema
> 15/11/16 22:19:19 INFO HiveMetaStore: 0: get_table : db=default 
> tbl=hevents_test
> 15/11/16 22:19:19 INFO audit: ugi=sim ip=unknown-ip-addr  cmd=get_table : 
> db=default tbl=hevents_test
> root
>  |-- vertical: string (nullable = true)
>  |-- did: string (nullable = true)
>  |-- surl: string (nullable = true)
>  |-- creative_id: long (nullable = true)
>  |-- keyword_text: string (nullable = true)
>  |-- errors: integer (nullable = true)
>  |-- views: integer (nullable = true)
>  |-- clicks: integer (nullable = true)
>  |-- actions: long (nullable = true)
> {code}
> A basic aggregation with a SELECT expression works without a problem:
> {code}
> cala> ctx.sql("""
>  |   select 1.0*creative_id, sum(views) as views
>  |   from hevents_test
>  |   group by creative_id
>  |   having sum(views) > 500
>  | """)
> 15/11/16 22:25:53 INFO ParseDriver: Parsing command: select 1.0*creative_id, 
> sum(views) as views
>   from hevents_test
>   group by creative_id
>   having sum(views) > 500
> 15/11/16 22:25:53 INFO ParseDriver: Parse Completed
> 15/11/16 22:25:53 INFO HiveMetaStore: 0: get_table : db=default 
> tbl=hevents_test
> 15/11/16 22:25:53 INFO audit: ugi=sim ip=unknown-ip-addr  cmd=get_table : 
> db=default tbl=hevents_test
> res21: org.apache.spark.sql.DataFrame = [_c0: double, views: bigint]
> {code}
> However, if the expression is aliased, the analyzer gets confused about 
> {{views}}.
> {code}
> scala> ctx.sql("""
>  | select 1.0*creative_id as cid, sum(views) as views
>  | from hevents_test
>  | group by creative_id
>  | having sum(views) > 500
>  | """)
> 15/11/16 22:26:59 INFO ParseDriver: Parsing command: select 1.0*creative_id 
> as cid, sum(views) as views
> from hevents_test
> group by creative_id
> having sum(views) > 500
> 15/11/16 22:26:59 INFO ParseDriver: Parse Completed
> 15/11/16 22:26:59 INFO HiveMetaStore: 0: get_table : db=default 
> tbl=hevents_test
> 15/11/16 22:26:59 INFO audit: ugi=sim ip=unknown-ip-addr  cmd=get_table : 
> db=default tbl=hevents_test
> org.apache.spark.sql.AnalysisException: resolved attribute(s) views#72L 
> missing from 
> vertical#3,creative_id#6L,did#4,errors#8,clicks#10,actions#11L,views#9,keyword_text#7,surl#5
>  in operator !Aggregate [creative_id#6L], [cast((sum(views#72L) > cast(500 as 
> bigint)) as boolean) AS havingCondition#73,(1.0 * cast(creative_id#6L as 
> double)) AS cid#71,sum(cast(views#9 as bigint)) AS views#72L];
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 

[jira] [Comment Edited] (SPARK-11770) Spark SQL field resolution error in GROUP BY HAVING clause

2015-11-24 Thread Simeon Simeonov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15026065#comment-15026065
 ] 

Simeon Simeonov edited comment on SPARK-11770 at 11/25/15 2:53 AM:
---

[~smilegator] Your code runs fine in 1.5.2. My code generates the exact same 
error. 

What's the easiest way for me to share a sliver of data with you?


was (Author: simeons):
[~smilegator] Actually, your code runs fine in 1.5.2. My code generates the 
exact same error. 

What's the easiest way for me to share a sliver of data with you?

> Spark SQL field resolution error in GROUP BY HAVING clause
> --
>
> Key: SPARK-11770
> URL: https://issues.apache.org/jira/browse/SPARK-11770
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Simeon Simeonov
>  Labels: SQL
>
> A query fails to resolve columns from the source data when an alias is added 
> to the SELECT clause. I have not been able to isolate a reproducible 
> standalone test. Below are a series of {{spark-shell}} operations that show 
> the problem step-by-step. Spark SQL execution happens via {{HiveContext}}.
> I believe the root cause of the problem is that when (and only when) there 
> are aliased expression columns in the SELECT clause, Spark SQL "sees" columns 
> from the SELECT clause while evaluating a HAVING clause. According to the SQL 
> standard that should not happen.
> The table in question is simple:
> {code}
> scala> ctx.table("hevents_test").printSchema
> 15/11/16 22:19:19 INFO HiveMetaStore: 0: get_table : db=default 
> tbl=hevents_test
> 15/11/16 22:19:19 INFO audit: ugi=sim ip=unknown-ip-addr  cmd=get_table : 
> db=default tbl=hevents_test
> root
>  |-- vertical: string (nullable = true)
>  |-- did: string (nullable = true)
>  |-- surl: string (nullable = true)
>  |-- creative_id: long (nullable = true)
>  |-- keyword_text: string (nullable = true)
>  |-- errors: integer (nullable = true)
>  |-- views: integer (nullable = true)
>  |-- clicks: integer (nullable = true)
>  |-- actions: long (nullable = true)
> {code}
> A basic aggregation with a SELECT expression works without a problem:
> {code}
> cala> ctx.sql("""
>  |   select 1.0*creative_id, sum(views) as views
>  |   from hevents_test
>  |   group by creative_id
>  |   having sum(views) > 500
>  | """)
> 15/11/16 22:25:53 INFO ParseDriver: Parsing command: select 1.0*creative_id, 
> sum(views) as views
>   from hevents_test
>   group by creative_id
>   having sum(views) > 500
> 15/11/16 22:25:53 INFO ParseDriver: Parse Completed
> 15/11/16 22:25:53 INFO HiveMetaStore: 0: get_table : db=default 
> tbl=hevents_test
> 15/11/16 22:25:53 INFO audit: ugi=sim ip=unknown-ip-addr  cmd=get_table : 
> db=default tbl=hevents_test
> res21: org.apache.spark.sql.DataFrame = [_c0: double, views: bigint]
> {code}
> However, if the expression is aliased, the analyzer gets confused about 
> {{views}}.
> {code}
> scala> ctx.sql("""
>  | select 1.0*creative_id as cid, sum(views) as views
>  | from hevents_test
>  | group by creative_id
>  | having sum(views) > 500
>  | """)
> 15/11/16 22:26:59 INFO ParseDriver: Parsing command: select 1.0*creative_id 
> as cid, sum(views) as views
> from hevents_test
> group by creative_id
> having sum(views) > 500
> 15/11/16 22:26:59 INFO ParseDriver: Parse Completed
> 15/11/16 22:26:59 INFO HiveMetaStore: 0: get_table : db=default 
> tbl=hevents_test
> 15/11/16 22:26:59 INFO audit: ugi=sim ip=unknown-ip-addr  cmd=get_table : 
> db=default tbl=hevents_test
> org.apache.spark.sql.AnalysisException: resolved attribute(s) views#72L 
> missing from 
> vertical#3,creative_id#6L,did#4,errors#8,clicks#10,actions#11L,views#9,keyword_text#7,surl#5
>  in operator !Aggregate [creative_id#6L], [cast((sum(views#72L) > cast(500 as 
> bigint)) as boolean) AS havingCondition#73,(1.0 * cast(creative_id#6L as 
> double)) AS cid#71,sum(cast(views#9 as bigint)) AS views#72L];
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 

[jira] [Reopened] (SPARK-11770) Spark SQL field resolution error in GROUP BY HAVING clause

2015-11-24 Thread Simeon Simeonov (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Simeon Simeonov reopened SPARK-11770:
-

[~smilegator] Actually, your code runs fine in 1.5.2. My code generates the 
exact same error. 

What's the easiest way for me to share a sliver of data with you?

> Spark SQL field resolution error in GROUP BY HAVING clause
> --
>
> Key: SPARK-11770
> URL: https://issues.apache.org/jira/browse/SPARK-11770
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Simeon Simeonov
>  Labels: SQL
>
> A query fails to resolve columns from the source data when an alias is added 
> to the SELECT clause. I have not been able to isolate a reproducible 
> standalone test. Below are a series of {{spark-shell}} operations that show 
> the problem step-by-step. Spark SQL execution happens via {{HiveContext}}.
> I believe the root cause of the problem is that when (and only when) there 
> are aliased expression columns in the SELECT clause, Spark SQL "sees" columns 
> from the SELECT clause while evaluating a HAVING clause. According to the SQL 
> standard that should not happen.
> The table in question is simple:
> {code}
> scala> ctx.table("hevents_test").printSchema
> 15/11/16 22:19:19 INFO HiveMetaStore: 0: get_table : db=default 
> tbl=hevents_test
> 15/11/16 22:19:19 INFO audit: ugi=sim ip=unknown-ip-addr  cmd=get_table : 
> db=default tbl=hevents_test
> root
>  |-- vertical: string (nullable = true)
>  |-- did: string (nullable = true)
>  |-- surl: string (nullable = true)
>  |-- creative_id: long (nullable = true)
>  |-- keyword_text: string (nullable = true)
>  |-- errors: integer (nullable = true)
>  |-- views: integer (nullable = true)
>  |-- clicks: integer (nullable = true)
>  |-- actions: long (nullable = true)
> {code}
> A basic aggregation with a SELECT expression works without a problem:
> {code}
> cala> ctx.sql("""
>  |   select 1.0*creative_id, sum(views) as views
>  |   from hevents_test
>  |   group by creative_id
>  |   having sum(views) > 500
>  | """)
> 15/11/16 22:25:53 INFO ParseDriver: Parsing command: select 1.0*creative_id, 
> sum(views) as views
>   from hevents_test
>   group by creative_id
>   having sum(views) > 500
> 15/11/16 22:25:53 INFO ParseDriver: Parse Completed
> 15/11/16 22:25:53 INFO HiveMetaStore: 0: get_table : db=default 
> tbl=hevents_test
> 15/11/16 22:25:53 INFO audit: ugi=sim ip=unknown-ip-addr  cmd=get_table : 
> db=default tbl=hevents_test
> res21: org.apache.spark.sql.DataFrame = [_c0: double, views: bigint]
> {code}
> However, if the expression is aliased, the analyzer gets confused about 
> {{views}}.
> {code}
> scala> ctx.sql("""
>  | select 1.0*creative_id as cid, sum(views) as views
>  | from hevents_test
>  | group by creative_id
>  | having sum(views) > 500
>  | """)
> 15/11/16 22:26:59 INFO ParseDriver: Parsing command: select 1.0*creative_id 
> as cid, sum(views) as views
> from hevents_test
> group by creative_id
> having sum(views) > 500
> 15/11/16 22:26:59 INFO ParseDriver: Parse Completed
> 15/11/16 22:26:59 INFO HiveMetaStore: 0: get_table : db=default 
> tbl=hevents_test
> 15/11/16 22:26:59 INFO audit: ugi=sim ip=unknown-ip-addr  cmd=get_table : 
> db=default tbl=hevents_test
> org.apache.spark.sql.AnalysisException: resolved attribute(s) views#72L 
> missing from 
> vertical#3,creative_id#6L,did#4,errors#8,clicks#10,actions#11L,views#9,keyword_text#7,surl#5
>  in operator !Aggregate [creative_id#6L], [cast((sum(views#72L) > cast(500 as 
> bigint)) as boolean) AS havingCondition#73,(1.0 * cast(creative_id#6L as 
> double)) AS cid#71,sum(cast(views#9 as bigint)) AS views#72L];
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 
> 

[jira] [Issue Comment Deleted] (SPARK-11770) Spark SQL field resolution error in GROUP BY HAVING clause

2015-11-24 Thread Simeon Simeonov (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Simeon Simeonov updated SPARK-11770:

Comment: was deleted

(was: [~smilegator] I cannot reproduce the problem under v1.5.2. Closing.)

> Spark SQL field resolution error in GROUP BY HAVING clause
> --
>
> Key: SPARK-11770
> URL: https://issues.apache.org/jira/browse/SPARK-11770
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Simeon Simeonov
>  Labels: SQL
>
> A query fails to resolve columns from the source data when an alias is added 
> to the SELECT clause. I have not been able to isolate a reproducible 
> standalone test. Below are a series of {{spark-shell}} operations that show 
> the problem step-by-step. Spark SQL execution happens via {{HiveContext}}.
> I believe the root cause of the problem is that when (and only when) there 
> are aliased expression columns in the SELECT clause, Spark SQL "sees" columns 
> from the SELECT clause while evaluating a HAVING clause. According to the SQL 
> standard that should not happen.
> The table in question is simple:
> {code}
> scala> ctx.table("hevents_test").printSchema
> 15/11/16 22:19:19 INFO HiveMetaStore: 0: get_table : db=default 
> tbl=hevents_test
> 15/11/16 22:19:19 INFO audit: ugi=sim ip=unknown-ip-addr  cmd=get_table : 
> db=default tbl=hevents_test
> root
>  |-- vertical: string (nullable = true)
>  |-- did: string (nullable = true)
>  |-- surl: string (nullable = true)
>  |-- creative_id: long (nullable = true)
>  |-- keyword_text: string (nullable = true)
>  |-- errors: integer (nullable = true)
>  |-- views: integer (nullable = true)
>  |-- clicks: integer (nullable = true)
>  |-- actions: long (nullable = true)
> {code}
> A basic aggregation with a SELECT expression works without a problem:
> {code}
> cala> ctx.sql("""
>  |   select 1.0*creative_id, sum(views) as views
>  |   from hevents_test
>  |   group by creative_id
>  |   having sum(views) > 500
>  | """)
> 15/11/16 22:25:53 INFO ParseDriver: Parsing command: select 1.0*creative_id, 
> sum(views) as views
>   from hevents_test
>   group by creative_id
>   having sum(views) > 500
> 15/11/16 22:25:53 INFO ParseDriver: Parse Completed
> 15/11/16 22:25:53 INFO HiveMetaStore: 0: get_table : db=default 
> tbl=hevents_test
> 15/11/16 22:25:53 INFO audit: ugi=sim ip=unknown-ip-addr  cmd=get_table : 
> db=default tbl=hevents_test
> res21: org.apache.spark.sql.DataFrame = [_c0: double, views: bigint]
> {code}
> However, if the expression is aliased, the analyzer gets confused about 
> {{views}}.
> {code}
> scala> ctx.sql("""
>  | select 1.0*creative_id as cid, sum(views) as views
>  | from hevents_test
>  | group by creative_id
>  | having sum(views) > 500
>  | """)
> 15/11/16 22:26:59 INFO ParseDriver: Parsing command: select 1.0*creative_id 
> as cid, sum(views) as views
> from hevents_test
> group by creative_id
> having sum(views) > 500
> 15/11/16 22:26:59 INFO ParseDriver: Parse Completed
> 15/11/16 22:26:59 INFO HiveMetaStore: 0: get_table : db=default 
> tbl=hevents_test
> 15/11/16 22:26:59 INFO audit: ugi=sim ip=unknown-ip-addr  cmd=get_table : 
> db=default tbl=hevents_test
> org.apache.spark.sql.AnalysisException: resolved attribute(s) views#72L 
> missing from 
> vertical#3,creative_id#6L,did#4,errors#8,clicks#10,actions#11L,views#9,keyword_text#7,surl#5
>  in operator !Aggregate [creative_id#6L], [cast((sum(views#72L) > cast(500 as 
> bigint)) as boolean) AS havingCondition#73,(1.0 * cast(creative_id#6L as 
> double)) AS cid#71,sum(cast(views#9 as bigint)) AS views#72L];
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 

[jira] [Commented] (SPARK-10574) HashingTF should use MurmurHash3

2015-11-24 Thread Simeon Simeonov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15026082#comment-15026082
 ] 

Simeon Simeonov commented on SPARK-10574:
-

[~josephkb] +1 on (1)-(3) and sorry about the delay in responding.

I've been doing some hashing experiments on our data and the 32-bit Murmur3 is 
doing very well performance-wise so going with it initially makes sense.

I'm sorry, I don't quite understand your point about the setter API...

> HashingTF should use MurmurHash3
> 
>
> Key: SPARK-10574
> URL: https://issues.apache.org/jira/browse/SPARK-10574
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.5.0
>Reporter: Simeon Simeonov
>Priority: Critical
>  Labels: HashingTF, hashing, mllib
>
> {{HashingTF}} uses the Scala native hashing {{##}} implementation. There are 
> two significant problems with this.
> First, per the [Scala 
> documentation|http://www.scala-lang.org/api/2.10.4/index.html#scala.Any] for 
> {{hashCode}}, the implementation is platform specific. This means that 
> feature vectors created on one platform may be different than vectors created 
> on another platform. This can create significant problems when a model 
> trained offline is used in another environment for online prediction. The 
> problem is made harder by the fact that following a hashing transform 
> features lose human-tractable meaning and a problem such as this may be 
> extremely difficult to track down.
> Second, the native Scala hashing function performs badly on longer strings, 
> exhibiting [200-500% higher collision 
> rates|https://gist.github.com/ssimeonov/eb12fcda75615e4a8d46] than, for 
> example, 
> [MurmurHash3|http://www.scala-lang.org/api/2.10.4/#scala.util.hashing.MurmurHash3$]
>  which is also included in the standard Scala libraries and is the hashing 
> choice of fast learners such as Vowpal Wabbit, scikit-learn and others. If 
> Spark users apply {{HashingTF}} only to very short, dictionary-like strings 
> the hashing function choice will not be a big problem but why have an 
> implementation in MLlib with this limitation when there is a better 
> implementation readily available in the standard Scala library?
> Switching to MurmurHash3 solves both problems. If there is agreement that 
> this is a good change, I can prepare a PR. 
> Note that changing the hash function would mean that models saved with a 
> previous version would have to be re-trained. This introduces a problem 
> that's orthogonal to breaking changes in APIs: breaking changes related to 
> artifacts, e.g., a saved model, produced by a previous version. Is there a 
> policy or best practice currently in effect about this? If not, perhaps we 
> should come up with a few simple rules about how we communicate these in 
> release notes, etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11770) Spark SQL field resolution error in GROUP BY HAVING clause

2015-11-16 Thread Simeon Simeonov (JIRA)
Simeon Simeonov created SPARK-11770:
---

 Summary: Spark SQL field resolution error in GROUP BY HAVING clause
 Key: SPARK-11770
 URL: https://issues.apache.org/jira/browse/SPARK-11770
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.1
Reporter: Simeon Simeonov


A query fails to resolve columns from the source data when an alias is added to 
the SELECT clause. I have not been able to isolate a reproducible standalone 
test. Below are a series of {{spark-shell}} operations that show the problem 
step-by-step. Spark SQL execution happens via {{HiveContext}}.

I believe the root cause of the problem is that when (and only when) there are 
aliased expression columns in the SELECT clause, Spark SQL "sees" columns from 
the SELECT clause while evaluating a HAVING clause. According to the SQL 
standard that should not happen.

The table in question is simple:

{code}
scala> ctx.table("hevents_test").printSchema
15/11/16 22:19:19 INFO HiveMetaStore: 0: get_table : db=default tbl=hevents_test
15/11/16 22:19:19 INFO audit: ugi=sim   ip=unknown-ip-addr  cmd=get_table : 
db=default tbl=hevents_test
root
 |-- vertical: string (nullable = true)
 |-- did: string (nullable = true)
 |-- surl: string (nullable = true)
 |-- creative_id: long (nullable = true)
 |-- keyword_text: string (nullable = true)
 |-- errors: integer (nullable = true)
 |-- views: integer (nullable = true)
 |-- clicks: integer (nullable = true)
 |-- actions: long (nullable = true)
{code}

A basic aggregation with a SELECT expression works without a problem:

{code}
cala> ctx.sql("""
 |   select 1.0*creative_id, sum(views) as views
 |   from hevents_test
 |   group by creative_id
 |   having sum(views) > 500
 | """)
15/11/16 22:25:53 INFO ParseDriver: Parsing command: select 1.0*creative_id, 
sum(views) as views
  from hevents_test
  group by creative_id
  having sum(views) > 500
15/11/16 22:25:53 INFO ParseDriver: Parse Completed
15/11/16 22:25:53 INFO HiveMetaStore: 0: get_table : db=default tbl=hevents_test
15/11/16 22:25:53 INFO audit: ugi=sim   ip=unknown-ip-addr  cmd=get_table : 
db=default tbl=hevents_test
res21: org.apache.spark.sql.DataFrame = [_c0: double, views: bigint]
{code}

However, if the expression is aliased, the analyzer gets confused about 
{{views}}.

{code}
scala> ctx.sql("""
 | select 1.0*creative_id as cid, sum(views) as views
 | from hevents_test
 | group by creative_id
 | having sum(views) > 500
 | """)
15/11/16 22:26:59 INFO ParseDriver: Parsing command: select 1.0*creative_id as 
cid, sum(views) as views
from hevents_test
group by creative_id
having sum(views) > 500
15/11/16 22:26:59 INFO ParseDriver: Parse Completed
15/11/16 22:26:59 INFO HiveMetaStore: 0: get_table : db=default tbl=hevents_test
15/11/16 22:26:59 INFO audit: ugi=sim   ip=unknown-ip-addr  cmd=get_table : 
db=default tbl=hevents_test
org.apache.spark.sql.AnalysisException: resolved attribute(s) views#72L missing 
from 
vertical#3,creative_id#6L,did#4,errors#8,clicks#10,actions#11L,views#9,keyword_text#7,surl#5
 in operator !Aggregate [creative_id#6L], [cast((sum(views#72L) > cast(500 as 
bigint)) as boolean) AS havingCondition#73,(1.0 * cast(creative_id#6L as 
double)) AS cid#71,sum(cast(views#9 as bigint)) AS views#72L];
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
at scala.collection.immutable.List.foreach(List.scala:318)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
at scala.collection.immutable.List.foreach(List.scala:318)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:49)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44)
at 
org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:914)
at 

[jira] [Created] (SPARK-11522) input_file_name() returns "" for external tables

2015-11-04 Thread Simeon Simeonov (JIRA)
Simeon Simeonov created SPARK-11522:
---

 Summary: input_file_name() returns "" for external tables
 Key: SPARK-11522
 URL: https://issues.apache.org/jira/browse/SPARK-11522
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.1
Reporter: Simeon Simeonov


Given an external table definition where the data consists of many CSV files, 
{{input_file_path()}} returns empty strings.

Table definition:

{code}
CREATE EXTERNAL TABLE external_test(page_id INT, impressions INT) 
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
   "separatorChar" = ",",
   "quoteChar" = "\"",
   "escapeChar"= "\\"
)  
LOCATION 'file:///Users/sim/spark/test/external_test'
{code}

Query: 

{code}
sql("SELECT input_file_name() as file FROM external_test").show
{code}

Output:

{code}
++
|file|
++
||
||
...
||
++
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11523) spark_partition_id() considered invalid function

2015-11-04 Thread Simeon Simeonov (JIRA)
Simeon Simeonov created SPARK-11523:
---

 Summary: spark_partition_id() considered invalid function
 Key: SPARK-11523
 URL: https://issues.apache.org/jira/browse/SPARK-11523
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.1
Reporter: Simeon Simeonov


{{spark_partition_id()}} works correctly in top-level {{SELECT}} statements but 
is not recognized in {{SELECT}} statements that define views. It seems DDL 
processing vs. execution in Spark SQL use two different parsers and/or 
environments.

In the following examples, instead of the {{test_data}} table you can use any 
defined table name.

A top-level statement works:

{code}
scala> ctx.sql("select spark_partition_id() as partition_id from 
test_data").show
++
|partition_id|
++
|   0|
...
|   0|
++
only showing top 20 rows
{code}

The same query in a view definition fails with {{Invalid function 
'spark_partition_id'}}.

{code}
scala> ctx.sql("create view test_view as select spark_partition_id() as 
partition_id from test_data")
15/11/05 01:05:38 INFO ParseDriver: Parsing command: create view test_view as 
select spark_partition_id() as partition_id from test_data
15/11/05 01:05:38 INFO ParseDriver: Parse Completed
15/11/05 01:05:38 INFO PerfLogger: 
15/11/05 01:05:38 INFO PerfLogger: 
15/11/05 01:05:38 INFO PerfLogger: 
15/11/05 01:05:38 INFO PerfLogger: 
15/11/05 01:05:38 INFO ParseDriver: Parsing command: create view test_view as 
select spark_partition_id() as partition_id from test_data
15/11/05 01:05:38 INFO ParseDriver: Parse Completed
15/11/05 01:05:38 INFO PerfLogger: 
15/11/05 01:05:38 INFO PerfLogger: 
15/11/05 01:05:38 INFO CalcitePlanner: Starting Semantic Analysis
15/11/05 01:05:38 INFO CalcitePlanner: Creating view default.test_view 
position=12
15/11/05 01:05:38 INFO HiveMetaStore: 0: get_database: default
15/11/05 01:05:38 INFO audit: ugi=sim   ip=unknown-ip-addr  
cmd=get_database: default
15/11/05 01:05:38 INFO CalcitePlanner: Completed phase 1 of Semantic Analysis
15/11/05 01:05:38 INFO CalcitePlanner: Get metadata for source tables
15/11/05 01:05:38 INFO HiveMetaStore: 0: get_table : db=default tbl=test_data
15/11/05 01:05:38 INFO audit: ugi=sim   ip=unknown-ip-addr  cmd=get_table : 
db=default tbl=test_data
15/11/05 01:05:38 INFO CalcitePlanner: Get metadata for subqueries
15/11/05 01:05:38 INFO CalcitePlanner: Get metadata for destination tables
15/11/05 01:05:38 INFO Context: New scratch dir is 
hdfs://localhost:9000/tmp/hive/sim/3fce9b7e-011f-4632-b673-e29067779fa0/hive_2015-11-05_01-05-38_518_4526721093949438849-1
15/11/05 01:05:38 INFO CalcitePlanner: Completed getting MetaData in Semantic 
Analysis
15/11/05 01:05:38 INFO BaseSemanticAnalyzer: Not invoking CBO because the 
statement doesn't have QUERY or EXPLAIN as root and not a CTAS; has create view
15/11/05 01:05:38 ERROR Driver: FAILED: SemanticException [Error 10011]: Line 
1:32 Invalid function 'spark_partition_id'
org.apache.hadoop.hive.ql.parse.SemanticException: Line 1:32 Invalid function 
'spark_partition_id'
at 
org.apache.hadoop.hive.ql.parse.TypeCheckProcFactory$DefaultExprProcessor.getXpathOrFuncExprNodeDesc(TypeCheckProcFactory.java:925)
at 
org.apache.hadoop.hive.ql.parse.TypeCheckProcFactory$DefaultExprProcessor.process(TypeCheckProcFactory.java:1265)
at 
org.apache.hadoop.hive.ql.lib.DefaultRuleDispatcher.dispatch(DefaultRuleDispatcher.java:90)
at 
org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.dispatchAndReturn(DefaultGraphWalker.java:95)
at 
org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.dispatch(DefaultGraphWalker.java:79)
at 
org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.walk(DefaultGraphWalker.java:133)
at 
org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.startWalking(DefaultGraphWalker.java:110)
at 
org.apache.hadoop.hive.ql.parse.TypeCheckProcFactory.genExprNode(TypeCheckProcFactory.java:205)
at 
org.apache.hadoop.hive.ql.parse.TypeCheckProcFactory.genExprNode(TypeCheckProcFactory.java:149)
at 
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genAllExprNodeDesc(SemanticAnalyzer.java:10512)
at 
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genExprNodeDesc(SemanticAnalyzer.java:10468)
at 
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genSelectPlan(SemanticAnalyzer.java:3840)
at 
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genSelectPlan(SemanticAnalyzer.java:3619)
at 
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPostGroupByBodyPlan(SemanticAnalyzer.java:8956)
at 
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genBodyPlan(SemanticAnalyzer.java:8911)
at 
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:9756)
at 

[jira] [Commented] (SPARK-10217) Spark SQL cannot handle ordering directive in ORDER BY clauses with expressions

2015-10-15 Thread Simeon Simeonov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14959197#comment-14959197
 ] 

Simeon Simeonov commented on SPARK-10217:
-

Well, that would suggest the issue is fixed. :)

> Spark SQL cannot handle ordering directive in ORDER BY clauses with 
> expressions
> ---
>
> Key: SPARK-10217
> URL: https://issues.apache.org/jira/browse/SPARK-10217
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1
> Environment: Ubuntu on AWS
>Reporter: Simeon Simeonov
>  Labels: SQL, analyzers
>
> Spark SQL supports expressions in ORDER BY clauses, e.g.,
> {code}
> scala> sqlContext.sql("select cnt from test order by (cnt + cnt)")
> res2: org.apache.spark.sql.DataFrame = [cnt: bigint]
> {code}
> However, the analyzer gets confused when there is an explicit ordering 
> directive (ASC/DESC):
> {code}
> scala> sqlContext.sql("select cnt from test order by (cnt + cnt) asc")
> 15/08/25 04:08:02 INFO ParseDriver: Parsing command: select cnt from test 
> order by (cnt + cnt) asc
> org.apache.spark.sql.AnalysisException: extraneous input 'asc' expecting EOF 
> near ''; line 1 pos 40
>   at org.apache.spark.sql.hive.HiveQl$.createPlan(HiveQl.scala:289)
>   at 
> org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:41)
>   at 
> org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:40)
> ...
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-9344) Spark SQL documentation does not clarify INSERT INTO behavior

2015-10-01 Thread Simeon Simeonov (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Simeon Simeonov closed SPARK-9344.
--
Resolution: Cannot Reproduce

> Spark SQL documentation does not clarify INSERT INTO behavior
> -
>
> Key: SPARK-9344
> URL: https://issues.apache.org/jira/browse/SPARK-9344
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, SQL
>Affects Versions: 1.4.1
>Reporter: Simeon Simeonov
>Priority: Minor
>  Labels: documentation, sql
>
> The Spark SQL documentation does not address {{INSERT INTO}} behavior. The 
> section on Hive compatibility is misleading as it claims support for "the 
> vast majority of Hive features". The user mailing list has conflicting 
> information, including posts that claim {{INSERT INTO}} support targeting 1.0.
> In 1.4.1, using Hive {{INSERT INTO}} syntax generates parse errors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10574) HashingTF should use MurmurHash3

2015-10-01 Thread Simeon Simeonov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14939876#comment-14939876
 ] 

Simeon Simeonov commented on SPARK-10574:
-

[~josephkb] any thoughts on the above?

> HashingTF should use MurmurHash3
> 
>
> Key: SPARK-10574
> URL: https://issues.apache.org/jira/browse/SPARK-10574
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.5.0
>Reporter: Simeon Simeonov
>Priority: Critical
>  Labels: HashingTF, hashing, mllib
>
> {{HashingTF}} uses the Scala native hashing {{##}} implementation. There are 
> two significant problems with this.
> First, per the [Scala 
> documentation|http://www.scala-lang.org/api/2.10.4/index.html#scala.Any] for 
> {{hashCode}}, the implementation is platform specific. This means that 
> feature vectors created on one platform may be different than vectors created 
> on another platform. This can create significant problems when a model 
> trained offline is used in another environment for online prediction. The 
> problem is made harder by the fact that following a hashing transform 
> features lose human-tractable meaning and a problem such as this may be 
> extremely difficult to track down.
> Second, the native Scala hashing function performs badly on longer strings, 
> exhibiting [200-500% higher collision 
> rates|https://gist.github.com/ssimeonov/eb12fcda75615e4a8d46] than, for 
> example, 
> [MurmurHash3|http://www.scala-lang.org/api/2.10.4/#scala.util.hashing.MurmurHash3$]
>  which is also included in the standard Scala libraries and is the hashing 
> choice of fast learners such as Vowpal Wabbit, scikit-learn and others. If 
> Spark users apply {{HashingTF}} only to very short, dictionary-like strings 
> the hashing function choice will not be a big problem but why have an 
> implementation in MLlib with this limitation when there is a better 
> implementation readily available in the standard Scala library?
> Switching to MurmurHash3 solves both problems. If there is agreement that 
> this is a good change, I can prepare a PR. 
> Note that changing the hash function would mean that models saved with a 
> previous version would have to be re-trained. This introduces a problem 
> that's orthogonal to breaking changes in APIs: breaking changes related to 
> artifacts, e.g., a saved model, produced by a previous version. Is there a 
> policy or best practice currently in effect about this? If not, perhaps we 
> should come up with a few simple rules about how we communicate these in 
> release notes, etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9344) Spark SQL documentation does not clarify INSERT INTO behavior

2015-10-01 Thread Simeon Simeonov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14939871#comment-14939871
 ] 

Simeon Simeonov commented on SPARK-9344:


[~joshrosen] you are absolutely right. I was rushing to create a test case and 
used {{save}} as opposed to {{saveAsTable}} by mistake. I don't have the code 
that generated the error originally: it was a complex set of Spark SQL 
statements. Either way, when I tried this with {{saveAsTable}} in 1.5.0 it 
worked so I'm closing the issue.

> Spark SQL documentation does not clarify INSERT INTO behavior
> -
>
> Key: SPARK-9344
> URL: https://issues.apache.org/jira/browse/SPARK-9344
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, SQL
>Affects Versions: 1.4.1
>Reporter: Simeon Simeonov
>Priority: Minor
>  Labels: documentation, sql
>
> The Spark SQL documentation does not address {{INSERT INTO}} behavior. The 
> section on Hive compatibility is misleading as it claims support for "the 
> vast majority of Hive features". The user mailing list has conflicting 
> information, including posts that claim {{INSERT INTO}} support targeting 1.0.
> In 1.4.1, using Hive {{INSERT INTO}} syntax generates parse errors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9762) ALTER TABLE cannot find column

2015-10-01 Thread Simeon Simeonov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14939883#comment-14939883
 ] 

Simeon Simeonov commented on SPARK-9762:


[~yhuai] any thoughts on this?

> ALTER TABLE cannot find column
> --
>
> Key: SPARK-9762
> URL: https://issues.apache.org/jira/browse/SPARK-9762
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1
> Environment: Ubuntu on AWS
>Reporter: Simeon Simeonov
>
> {{ALTER TABLE tbl CHANGE}} cannot find a column that {{DESCRIBE COLUMN}} 
> lists. 
> In the case of a table generated with {{HiveContext.read.json()}}, the output 
> of {{DESCRIBE dimension_components}} is:
> {code}
> comp_config   
> struct
> comp_criteria string
> comp_data_model   string
> comp_dimensions   
> struct
> comp_disabled boolean
> comp_id   bigint
> comp_path string
> comp_placementDatastruct
> comp_slot_types   array
> {code}
> However, {{alter table dimension_components change comp_dimensions 
> comp_dimensions 
> struct;}}
>  fails with:
> {code}
> 15/08/08 23:13:07 ERROR exec.DDLTask: 
> org.apache.hadoop.hive.ql.metadata.HiveException: Invalid column reference 
> comp_dimensions
>   at org.apache.hadoop.hive.ql.exec.DDLTask.alterTable(DDLTask.java:3584)
>   at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:312)
>   at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:153)
>   at 
> org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:85)
>   at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1503)
>   at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1270)
>   at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1088)
>   at org.apache.hadoop.hive.ql.Driver.run(Driver.java:911)
>   at org.apache.hadoop.hive.ql.Driver.run(Driver.java:901)
>   at 
> org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$runHive$1.apply(ClientWrapper.scala:345)
>   at 
> org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$runHive$1.apply(ClientWrapper.scala:326)
>   at 
> org.apache.spark.sql.hive.client.ClientWrapper.withHiveState(ClientWrapper.scala:155)
>   at 
> org.apache.spark.sql.hive.client.ClientWrapper.runHive(ClientWrapper.scala:326)
>   at 
> org.apache.spark.sql.hive.client.ClientWrapper.runSqlHive(ClientWrapper.scala:316)
>   at 
> org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:473)
> ...
> {code}
> Meanwhile, {{SHOW COLUMNS in dimension_components}} lists two columns: 
> {{col}} (which does not exist in the table) and {{z}}, which was just added.
> This suggests that DDL operations in Spark SQL use table metadata 
> inconsistently.
> Full spark-sql output 
> [here|https://gist.github.com/ssimeonov/636a25d6074a03aafa67].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9762) ALTER TABLE cannot find column

2015-10-01 Thread Simeon Simeonov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14940757#comment-14940757
 ] 

Simeon Simeonov commented on SPARK-9762:


[~yhuai] Refreshing is not the issue here. The issue is that {{DESCRIBE tbl}} 
and {{SHOW COLUMNS tbl}} show different columns for a table even without 
altering it which suggests that Spark SQL is not managing table metadata 
correctly.

> ALTER TABLE cannot find column
> --
>
> Key: SPARK-9762
> URL: https://issues.apache.org/jira/browse/SPARK-9762
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1
> Environment: Ubuntu on AWS
>Reporter: Simeon Simeonov
>
> {{ALTER TABLE tbl CHANGE}} cannot find a column that {{DESCRIBE COLUMN}} 
> lists. 
> In the case of a table generated with {{HiveContext.read.json()}}, the output 
> of {{DESCRIBE dimension_components}} is:
> {code}
> comp_config   
> struct
> comp_criteria string
> comp_data_model   string
> comp_dimensions   
> struct
> comp_disabled boolean
> comp_id   bigint
> comp_path string
> comp_placementDatastruct
> comp_slot_types   array
> {code}
> However, {{alter table dimension_components change comp_dimensions 
> comp_dimensions 
> struct;}}
>  fails with:
> {code}
> 15/08/08 23:13:07 ERROR exec.DDLTask: 
> org.apache.hadoop.hive.ql.metadata.HiveException: Invalid column reference 
> comp_dimensions
>   at org.apache.hadoop.hive.ql.exec.DDLTask.alterTable(DDLTask.java:3584)
>   at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:312)
>   at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:153)
>   at 
> org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:85)
>   at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1503)
>   at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1270)
>   at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1088)
>   at org.apache.hadoop.hive.ql.Driver.run(Driver.java:911)
>   at org.apache.hadoop.hive.ql.Driver.run(Driver.java:901)
>   at 
> org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$runHive$1.apply(ClientWrapper.scala:345)
>   at 
> org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$runHive$1.apply(ClientWrapper.scala:326)
>   at 
> org.apache.spark.sql.hive.client.ClientWrapper.withHiveState(ClientWrapper.scala:155)
>   at 
> org.apache.spark.sql.hive.client.ClientWrapper.runHive(ClientWrapper.scala:326)
>   at 
> org.apache.spark.sql.hive.client.ClientWrapper.runSqlHive(ClientWrapper.scala:316)
>   at 
> org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:473)
> ...
> {code}
> Meanwhile, {{SHOW COLUMNS in dimension_components}} lists two columns: 
> {{col}} (which does not exist in the table) and {{z}}, which was just added.
> This suggests that DDL operations in Spark SQL use table metadata 
> inconsistently.
> Full spark-sql output 
> [here|https://gist.github.com/ssimeonov/636a25d6074a03aafa67].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9762) ALTER TABLE cannot find column

2015-10-01 Thread Simeon Simeonov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14940776#comment-14940776
 ] 

Simeon Simeonov commented on SPARK-9762:


[~yhuai] the Hive compatibility section of the documentation should be updated 
to identify these cases. It is unfortunate to trust the docs only to discover a 
known lack of compatibility that was not documented.

> ALTER TABLE cannot find column
> --
>
> Key: SPARK-9762
> URL: https://issues.apache.org/jira/browse/SPARK-9762
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1
> Environment: Ubuntu on AWS
>Reporter: Simeon Simeonov
>
> {{ALTER TABLE tbl CHANGE}} cannot find a column that {{DESCRIBE COLUMN}} 
> lists. 
> In the case of a table generated with {{HiveContext.read.json()}}, the output 
> of {{DESCRIBE dimension_components}} is:
> {code}
> comp_config   
> struct
> comp_criteria string
> comp_data_model   string
> comp_dimensions   
> struct
> comp_disabled boolean
> comp_id   bigint
> comp_path string
> comp_placementDatastruct
> comp_slot_types   array
> {code}
> However, {{alter table dimension_components change comp_dimensions 
> comp_dimensions 
> struct;}}
>  fails with:
> {code}
> 15/08/08 23:13:07 ERROR exec.DDLTask: 
> org.apache.hadoop.hive.ql.metadata.HiveException: Invalid column reference 
> comp_dimensions
>   at org.apache.hadoop.hive.ql.exec.DDLTask.alterTable(DDLTask.java:3584)
>   at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:312)
>   at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:153)
>   at 
> org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:85)
>   at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1503)
>   at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1270)
>   at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1088)
>   at org.apache.hadoop.hive.ql.Driver.run(Driver.java:911)
>   at org.apache.hadoop.hive.ql.Driver.run(Driver.java:901)
>   at 
> org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$runHive$1.apply(ClientWrapper.scala:345)
>   at 
> org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$runHive$1.apply(ClientWrapper.scala:326)
>   at 
> org.apache.spark.sql.hive.client.ClientWrapper.withHiveState(ClientWrapper.scala:155)
>   at 
> org.apache.spark.sql.hive.client.ClientWrapper.runHive(ClientWrapper.scala:326)
>   at 
> org.apache.spark.sql.hive.client.ClientWrapper.runSqlHive(ClientWrapper.scala:316)
>   at 
> org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:473)
> ...
> {code}
> Meanwhile, {{SHOW COLUMNS in dimension_components}} lists two columns: 
> {{col}} (which does not exist in the table) and {{z}}, which was just added.
> This suggests that DDL operations in Spark SQL use table metadata 
> inconsistently.
> Full spark-sql output 
> [here|https://gist.github.com/ssimeonov/636a25d6074a03aafa67].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9761) Inconsistent metadata handling with ALTER TABLE

2015-10-01 Thread Simeon Simeonov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14940778#comment-14940778
 ] 

Simeon Simeonov commented on SPARK-9761:


[~yhuai] What about this one? The problem survives as restart so it doesn't 
seem to be caused by lack of refreshing.

> Inconsistent metadata handling with ALTER TABLE
> ---
>
> Key: SPARK-9761
> URL: https://issues.apache.org/jira/browse/SPARK-9761
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1
> Environment: Ubuntu on AWS
>Reporter: Simeon Simeonov
>  Labels: hive, sql
>
> Schema changes made with {{ALTER TABLE}} are not shown in {{DESCRIBE TABLE}}. 
> The table in question was created with {{HiveContext.read.json()}}.
> Steps:
> # {{alter table dimension_components add columns (z string);}} succeeds.
> # {{describe dimension_components;}} does not show the new column, even after 
> restarting spark-sql.
> # A second {{alter table dimension_components add columns (z string);}} fails 
> with RROR exec.DDLTask: org.apache.hadoop.hive.ql.metadata.HiveException: 
> Duplicate column name: z
> Full spark-sql output 
> [here|https://gist.github.com/ssimeonov/d9af4b8bb76b9d7befde].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9344) Spark SQL documentation does not clarify INSERT INTO behavior

2015-09-30 Thread Simeon Simeonov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14938814#comment-14938814
 ] 

Simeon Simeonov commented on SPARK-9344:


/cc [~joshrosen] [~andrewor14]

> Spark SQL documentation does not clarify INSERT INTO behavior
> -
>
> Key: SPARK-9344
> URL: https://issues.apache.org/jira/browse/SPARK-9344
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, SQL
>Affects Versions: 1.4.1
>Reporter: Simeon Simeonov
>Priority: Minor
>  Labels: documentation, sql
>
> The Spark SQL documentation does not address {{INSERT INTO}} behavior. The 
> section on Hive compatibility is misleading as it claims support for "the 
> vast majority of Hive features". The user mailing list has conflicting 
> information, including posts that claim {{INSERT INTO}} support targeting 1.0.
> In 1.4.1, using Hive {{INSERT INTO}} syntax generates parse errors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9344) Spark SQL documentation does not clarify INSERT INTO behavior

2015-09-30 Thread Simeon Simeonov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14938825#comment-14938825
 ] 

Simeon Simeonov commented on SPARK-9344:


[~joshrosen] when I logged the bug I was using {{HiveContext}}. 

Given how many Spark SQL bugs are logged here, e.g., issues with view support, 
it does make sense for the SQL docs to become more reality-based. :)

> Spark SQL documentation does not clarify INSERT INTO behavior
> -
>
> Key: SPARK-9344
> URL: https://issues.apache.org/jira/browse/SPARK-9344
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, SQL
>Affects Versions: 1.4.1
>Reporter: Simeon Simeonov
>Priority: Minor
>  Labels: documentation, sql
>
> The Spark SQL documentation does not address {{INSERT INTO}} behavior. The 
> section on Hive compatibility is misleading as it claims support for "the 
> vast majority of Hive features". The user mailing list has conflicting 
> information, including posts that claim {{INSERT INTO}} support targeting 1.0.
> In 1.4.1, using Hive {{INSERT INTO}} syntax generates parse errors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9344) Spark SQL documentation does not clarify INSERT INTO behavior

2015-09-30 Thread Simeon Simeonov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14938806#comment-14938806
 ] 

Simeon Simeonov commented on SPARK-9344:


Are you suggesting to fix the docs or the code?

> Spark SQL documentation does not clarify INSERT INTO behavior
> -
>
> Key: SPARK-9344
> URL: https://issues.apache.org/jira/browse/SPARK-9344
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, SQL
>Affects Versions: 1.4.1
>Reporter: Simeon Simeonov
>Priority: Minor
>  Labels: documentation, sql
>
> The Spark SQL documentation does not address {{INSERT INTO}} behavior. The 
> section on Hive compatibility is misleading as it claims support for "the 
> vast majority of Hive features". The user mailing list has conflicting 
> information, including posts that claim {{INSERT INTO}} support targeting 1.0.
> In 1.4.1, using Hive {{INSERT INTO}} syntax generates parse errors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9344) Spark SQL documentation does not clarify INSERT INTO behavior

2015-09-30 Thread Simeon Simeonov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14938918#comment-14938918
 ] 

Simeon Simeonov commented on SPARK-9344:


[~joshrosen] Here is the reproducible test case you can try in {{spark-shell}}:

{code}
import org.apache.spark.sql.hive.HiveContext

val ctx = sqlContext.asInstanceOf[HiveContext]
import ctx.implicits._

(1 to 5).map(Tuple1.apply).toDF("w_int").write.save("test_data1")
(6 to 9).map(Tuple1.apply).toDF("w_int").write.save("test_data2")

ctx.sql("insert into table test_data1 select * from test_data2")
{code}

This fails with:

{code}
scala> ctx.sql("insert into table test_data1 select * from test_data2")
15/09/30 17:32:34 INFO ParseDriver: Parsing command: insert into table 
test_data1 select * from test_data2
15/09/30 17:32:34 INFO ParseDriver: Parse Completed
15/09/30 17:32:34 INFO HiveMetaStore: 0: get_table : db=default tbl=test_data1
15/09/30 17:32:34 INFO audit: ugi=sim   ip=unknown-ip-addr  cmd=get_table : 
db=default tbl=test_data1
org.apache.spark.sql.AnalysisException: no such table test_data1; line 1 pos 18
at 
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.getTable(Analyzer.scala:225)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$7.applyOrElse(Analyzer.scala:231)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$7.applyOrElse(Analyzer.scala:229)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:222)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:222)
at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:221)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:212)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.apply(Analyzer.scala:229)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.apply(Analyzer.scala:219)
at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:61)
at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:59)
at 
scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:111)
at scala.collection.immutable.List.foldLeft(List.scala:84)
at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:59)
at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:51)
at scala.collection.immutable.List.foreach(List.scala:318)
at 
org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:51)
at 
org.apache.spark.sql.SQLContext$QueryExecution.analyzed$lzycompute(SQLContext.scala:933)
at 
org.apache.spark.sql.SQLContext$QueryExecution.analyzed(SQLContext.scala:933)
at 
org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:931)
at org.apache.spark.sql.DataFrame.(DataFrame.scala:131)
at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:51)
at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:755)
at 
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:39)
at 
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:44)
at 
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:46)
at 
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:48)
at 
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:50)
at 
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:52)
at 
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:54)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:56)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:58)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:60)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:62)
at $iwC$$iwC$$iwC$$iwC$$iwC.(:64)
at $iwC$$iwC$$iwC$$iwC.(:66)
at $iwC$$iwC$$iwC.(:68)
at $iwC$$iwC.(:70)
at $iwC.(:72)
at (:74)
at .(:78)
at .()
at .(:7)
at .()
at $print()
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 

[jira] [Created] (SPARK-10724) SQL's floor() returns DOUBLE

2015-09-20 Thread Simeon Simeonov (JIRA)
Simeon Simeonov created SPARK-10724:
---

 Summary: SQL's floor() returns DOUBLE
 Key: SPARK-10724
 URL: https://issues.apache.org/jira/browse/SPARK-10724
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.0
Reporter: Simeon Simeonov
Priority: Critical


This is a change in behavior from 1.4.1 where {{floor}} returns a BIGINT. 

{code}
scala> sql("select floor(1)").printSchema
root
 |-- _c0: double (nullable = true)
{code}

In the [Hive Language 
Manual|https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF] 
{{floor}} is defined to return BIGINT.

This is a significant issue because it changes the DataFrame schema.

I wonder what caused this and whether other SQL functions are affected.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10722) Uncaught exception: RDDBlockId not found in driver-heartbeater

2015-09-20 Thread Simeon Simeonov (JIRA)
Simeon Simeonov created SPARK-10722:
---

 Summary: Uncaught exception: RDDBlockId not found in 
driver-heartbeater
 Key: SPARK-10722
 URL: https://issues.apache.org/jira/browse/SPARK-10722
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.5.0, 1.4.1, 1.3.1
Reporter: Simeon Simeonov


Some operations involving cached RDDs generate an uncaught exception in 
driver-heartbeater. If the {{.cache()}} call is removed, processing happens 
without the exception. However, not all RDDs trigger the problem, i.e., some 
{{.cache()}} operations are fine. 

I can see the problem with 1.4.1 and 1.5.0 but I have not been able to create a 
reproducible test case. The same exception is [reported on 
SO|http://stackoverflow.com/questions/31280355/spark-test-on-local-machine] for 
v1.3.1 but the behavior is related to large broadcast variables.

The full stack trace is:

{code}
15/09/20 22:10:08 ERROR Utils: Uncaught exception in thread driver-heartbeater
java.io.IOException: java.lang.ClassNotFoundException: 
org.apache.spark.storage.RDDBlockId
  at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1163)
  at org.apache.spark.executor.TaskMetrics.readObject(TaskMetrics.scala:219)
  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
  at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  at java.lang.reflect.Method.invoke(Method.java:606)
  at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
  at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
  at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
  at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
  at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
  at org.apache.spark.util.Utils$.deserialize(Utils.scala:91)
  at 
org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$reportHeartBeat$1$$anonfun$apply$6.apply(Executor.scala:440)
  at 
org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$reportHeartBeat$1$$anonfun$apply$6.apply(Executor.scala:430)
  at scala.Option.foreach(Option.scala:236)
  at 
org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$reportHeartBeat$1.apply(Executor.scala:430)
  at 
org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$reportHeartBeat$1.apply(Executor.scala:428)
  at scala.collection.Iterator$class.foreach(Iterator.scala:727)
  at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
  at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
  at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
  at 
org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:428)
  at 
org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply$mcV$sp(Executor.scala:472)
  at 
org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:472)
  at 
org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:472)
  at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1699)
  at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:472)
  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
  at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304)
  at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178)
  at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
  at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
  at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
  at java.lang.Thread.run(Thread.java:744)
Caused by: java.lang.ClassNotFoundException: org.apache.spark.storage.RDDBlockId
  at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
  at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
  at java.security.AccessController.doPrivileged(Native Method)
  at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
  at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
  at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
  at java.lang.Class.forName0(Native Method)
  at java.lang.Class.forName(Class.java:270)
  at java.io.ObjectInputStream.resolveClass(ObjectInputStream.java:625)
  at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1612)
  at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517)
  at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771)
  at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
  at 

[jira] [Commented] (SPARK-10574) HashingTF should use MurmurHash3

2015-09-14 Thread Simeon Simeonov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14743922#comment-14743922
 ] 

Simeon Simeonov commented on SPARK-10574:
-

[~josephkb] this makes sense. There are a few decisions to make:

# Which data types should we support out-of-the-box hashing for? HashingTF is 
typically used on strings but the implementation is currently not restricted to 
that.

# If we choose to support them, how will the hashing happen for non-String 
types? For example, does a Double get converted to binary first (as the 
toString() representation may not be perfectly accurate) or do we call this a 
feature (a tiny bit of LSH to make reasoning more robust)? I believe our goal 
here should be to have a rock-solid, deterministic definition that works the 
same everywhere.

# How do we safely open the door for user-provided hashing functions? This 
could come at no extra cost through a simple trait and the parameter specifying 
the hashing function being a class name that must implement that trait, with 
Spark providing a murmur and a "native" implementation. I tend to prefer this 
simple pattern to hard-coded decision logic instantiating a finite set of 
classes. This would allow Spark users to experiment with xxHash, CityHash, 
minwise hashing and, if they so choose, even forms of locality-sensitive 
hashing. New hashes could be contributed to the project or become discoverable 
on spark-packages. It would also allow for analysis patterns down the road, 
e.g., a decorator class that analyzes the distribution of collisions as a side 
effect to help practitioners choose the right hashing function. 

I'd appreciate your thoughts.


> HashingTF should use MurmurHash3
> 
>
> Key: SPARK-10574
> URL: https://issues.apache.org/jira/browse/SPARK-10574
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.5.0
>Reporter: Simeon Simeonov
>Priority: Critical
>  Labels: HashingTF, hashing, mllib
>
> {{HashingTF}} uses the Scala native hashing {{##}} implementation. There are 
> two significant problems with this.
> First, per the [Scala 
> documentation|http://www.scala-lang.org/api/2.10.4/index.html#scala.Any] for 
> {{hashCode}}, the implementation is platform specific. This means that 
> feature vectors created on one platform may be different than vectors created 
> on another platform. This can create significant problems when a model 
> trained offline is used in another environment for online prediction. The 
> problem is made harder by the fact that following a hashing transform 
> features lose human-tractable meaning and a problem such as this may be 
> extremely difficult to track down.
> Second, the native Scala hashing function performs badly on longer strings, 
> exhibiting [200-500% higher collision 
> rates|https://gist.github.com/ssimeonov/eb12fcda75615e4a8d46] than, for 
> example, 
> [MurmurHash3|http://www.scala-lang.org/api/2.10.4/#scala.util.hashing.MurmurHash3$]
>  which is also included in the standard Scala libraries and is the hashing 
> choice of fast learners such as Vowpal Wabbit, scikit-learn and others. If 
> Spark users apply {{HashingTF}} only to very short, dictionary-like strings 
> the hashing function choice will not be a big problem but why have an 
> implementation in MLlib with this limitation when there is a better 
> implementation readily available in the standard Scala library?
> Switching to MurmurHash3 solves both problems. If there is agreement that 
> this is a good change, I can prepare a PR. 
> Note that changing the hash function would mean that models saved with a 
> previous version would have to be re-trained. This introduces a problem 
> that's orthogonal to breaking changes in APIs: breaking changes related to 
> artifacts, e.g., a saved model, produced by a previous version. Is there a 
> policy or best practice currently in effect about this? If not, perhaps we 
> should come up with a few simple rules about how we communicate these in 
> release notes, etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-10574) HashingTF should use MurmurHash3

2015-09-14 Thread Simeon Simeonov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14743922#comment-14743922
 ] 

Simeon Simeonov edited comment on SPARK-10574 at 9/14/15 5:55 PM:
--

[~josephkb] this makes sense. There are a few decisions to make:

# Which data types should we support out-of-the-box hashing for? HashingTF is 
typically used on strings but the implementation is currently not restricted to 
that.
# If we choose to support them, how will the hashing happen for non-String 
types? For example, does a Double get converted to binary first (as the 
toString() representation may not be perfectly accurate) or do we call this a 
feature (a tiny bit of LSH to make reasoning more robust)? I believe our goal 
here should be to have a rock-solid, deterministic definition that works the 
same everywhere.
# How do we safely open the door for user-provided hashing functions? This 
could come at no extra cost through a simple trait and the parameter specifying 
the hashing function being a class name that must implement that trait, with 
Spark providing a murmur and a "native" implementation. I tend to prefer this 
simple pattern to hard-coded decision logic instantiating a finite set of 
classes. This would allow Spark users to experiment with xxHash, CityHash, 
minwise hashing and, if they so choose, even forms of locality-sensitive 
hashing. New hashes could be contributed to the project or become discoverable 
on spark-packages. It would also allow for analysis patterns down the road, 
e.g., a decorator class that analyzes the distribution of collisions as a side 
effect to help practitioners choose the right hashing function. 

I'd appreciate your thoughts.



was (Author: simeons):
[~josephkb] this makes sense. There are a few decisions to make:

# Which data types should we support out-of-the-box hashing for? HashingTF is 
typically used on strings but the implementation is currently not restricted to 
that.

# If we choose to support them, how will the hashing happen for non-String 
types? For example, does a Double get converted to binary first (as the 
toString() representation may not be perfectly accurate) or do we call this a 
feature (a tiny bit of LSH to make reasoning more robust)? I believe our goal 
here should be to have a rock-solid, deterministic definition that works the 
same everywhere.

# How do we safely open the door for user-provided hashing functions? This 
could come at no extra cost through a simple trait and the parameter specifying 
the hashing function being a class name that must implement that trait, with 
Spark providing a murmur and a "native" implementation. I tend to prefer this 
simple pattern to hard-coded decision logic instantiating a finite set of 
classes. This would allow Spark users to experiment with xxHash, CityHash, 
minwise hashing and, if they so choose, even forms of locality-sensitive 
hashing. New hashes could be contributed to the project or become discoverable 
on spark-packages. It would also allow for analysis patterns down the road, 
e.g., a decorator class that analyzes the distribution of collisions as a side 
effect to help practitioners choose the right hashing function. 

I'd appreciate your thoughts.


> HashingTF should use MurmurHash3
> 
>
> Key: SPARK-10574
> URL: https://issues.apache.org/jira/browse/SPARK-10574
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.5.0
>Reporter: Simeon Simeonov
>Priority: Critical
>  Labels: HashingTF, hashing, mllib
>
> {{HashingTF}} uses the Scala native hashing {{##}} implementation. There are 
> two significant problems with this.
> First, per the [Scala 
> documentation|http://www.scala-lang.org/api/2.10.4/index.html#scala.Any] for 
> {{hashCode}}, the implementation is platform specific. This means that 
> feature vectors created on one platform may be different than vectors created 
> on another platform. This can create significant problems when a model 
> trained offline is used in another environment for online prediction. The 
> problem is made harder by the fact that following a hashing transform 
> features lose human-tractable meaning and a problem such as this may be 
> extremely difficult to track down.
> Second, the native Scala hashing function performs badly on longer strings, 
> exhibiting [200-500% higher collision 
> rates|https://gist.github.com/ssimeonov/eb12fcda75615e4a8d46] than, for 
> example, 
> [MurmurHash3|http://www.scala-lang.org/api/2.10.4/#scala.util.hashing.MurmurHash3$]
>  which is also included in the standard Scala libraries and is the hashing 
> choice of fast learners such as Vowpal Wabbit, scikit-learn and others. If 
> Spark users apply {{HashingTF}} only to very short, dictionary-like 

[jira] [Commented] (SPARK-8345) Add an SQL node as a feature transformer

2015-09-11 Thread Simeon Simeonov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14741618#comment-14741618
 ] 

Simeon Simeonov commented on SPARK-8345:


This would be very nice, especially as it can leverage existing UDFs.

> Add an SQL node as a feature transformer
> 
>
> Key: SPARK-8345
> URL: https://issues.apache.org/jira/browse/SPARK-8345
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Xiangrui Meng
>Assignee: Yanbo Liang
> Fix For: 1.6.0
>
>
> Some simple feature transformations can take leverage on SQL operators. Users 
> do not need to create an ML transformer for each of them. We can have an SQL 
> transformer that executes an SQL command which operates on the input 
> dataframe.
> {code}
> val sql = new SQL()
>   .setStatement("SELECT *, length(text) AS text_length FROM __THIS__")
> {code}
> where "__THIS__" will be replaced by a temp table that represents the 
> DataFrame.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10574) HashingTF should use MurmurHash3

2015-09-11 Thread Simeon Simeonov (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Simeon Simeonov updated SPARK-10574:

Description: 
{{HashingTF}} uses the Scala native hashing {{##}} implementation. There are 
two significant problems with this.

First, per the [Scala 
documentation|http://www.scala-lang.org/api/2.10.4/index.html#scala.Any] for 
{{hashCode}}, the implementation is platform specific. This means that feature 
vectors created on one platform may be different than vectors created on 
another platform. This can create significant problems when a model trained 
offline is used in another environment for online prediction. The problem is 
made harder by the fact that following a hashing transform features lose 
human-tractable meaning and a problem such as this may be extremely difficult 
to track down.

Second, the native Scala hashing function performs badly on longer strings, 
exhibiting [200-500% higher collision 
rates|https://gist.github.com/ssimeonov/eb12fcda75615e4a8d46] than, for 
example, 
[MurmurHash3|http://www.scala-lang.org/api/2.10.4/#scala.util.hashing.MurmurHash3$]
 which is also included in the standard Scala libraries and is the hashing 
choice of fast learners such as Vowpal Wabbit, scikit-learn and others. If 
Spark users apply {{HashingTF}} only to very short, dictionary-like strings the 
hashing function choice will not be a big problem but why have an 
implementation in MLlib with this limitation when there is a better 
implementation readily available in the standard Scala library?

Switching to MurmurHash3 solves both problems. If there is agreement that this 
is a good change, I can prepare a PR. 

Note that changing the hash function would mean that models saved with a 
previous version would have to be re-trained. This introduces a problem that's 
orthogonal to breaking changes in APIs: breaking changes related to artifacts, 
e.g., a saved model, produced by a previous version. Is there a policy or best 
practice currently in effect about this? If not, perhaps we should come up with 
a few simple rules about how we communicate these in release notes, etc.

  was:
{{HashingTF}} uses the Scala native hashing {{##}} implementation. There are 
two significant problems with this.

First, per the [Scala 
documentation|http://www.scala-lang.org/api/2.10.4/index.html#scala.Any] for 
{{hashCode}}, the implementation is platform specific. This means that feature 
vectors created on one platform may be different than vectors created on 
another platform. This can create significant problems when a model trained 
offline is used in another environment for online prediction. The problem is 
made harder by the fact that following a hashing transform features lose 
human-tractable meaning and a problem such as this may be extremely difficult 
to track down.

Second, the native Scala hashing function performs badly on longer strings, 
exhibiting [200-500% higher collision 
rates|https://gist.github.com/ssimeonov/eb12fcda75615e4a8d46] than, for 
example, 
[MurmurHash3|http://www.scala-lang.org/api/2.10.4/#scala.util.hashing.MurmurHash3$]
 which is also included in the standard Scala libraries and is the hashing 
choice of fast learners such as Vowpal Wabbit and others. If Spark users apply 
{{HashingTF}} only to very short, dictionary-like strings the hashing function 
choice will not be a big problem but why have an implementation in MLlib with 
this limitation when there is a better implementation readily available in the 
standard Scala library?

Switching to MurmurHash3 solves both problems. If there is agreement that this 
is a good change, I can prepare a PR. 

Note that changing the hash function would mean that models saved with a 
previous version would have to be re-trained. This introduces a problem that's 
orthogonal to breaking changes in APIs: breaking changes related to artifacts, 
e.g., a saved model, produced by a previous version. Is there a policy or best 
practice currently in effect about this? If not, perhaps we should come up with 
a few simple rules about how we communicate these in release notes, etc.


> HashingTF should use MurmurHash3
> 
>
> Key: SPARK-10574
> URL: https://issues.apache.org/jira/browse/SPARK-10574
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.5.0
>Reporter: Simeon Simeonov
>Priority: Critical
>  Labels: HashingTF, hashing, mllib
>
> {{HashingTF}} uses the Scala native hashing {{##}} implementation. There are 
> two significant problems with this.
> First, per the [Scala 
> documentation|http://www.scala-lang.org/api/2.10.4/index.html#scala.Any] for 
> {{hashCode}}, the implementation is platform specific. This means that 
> feature vectors created on one platform may be different 

[jira] [Created] (SPARK-10574) HashingTF should use MurmurHash3

2015-09-11 Thread Simeon Simeonov (JIRA)
Simeon Simeonov created SPARK-10574:
---

 Summary: HashingTF should use MurmurHash3
 Key: SPARK-10574
 URL: https://issues.apache.org/jira/browse/SPARK-10574
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.5.0
Reporter: Simeon Simeonov
Priority: Critical


{{HashingTF}} uses the Scala native hashing {{##}} implementation. There are 
two significant problems with this.

First, per the [Scala 
documentation|http://www.scala-lang.org/api/2.10.4/index.html#scala.Any] for 
{{hashCode}}, the implementation is platform specific. This means that feature 
vectors created on one platform may be different than vectors created on 
another platform. This can create significant problems when a model trained 
offline is used in another environment for online prediction. The problem is 
made harder by the fact that following a hashing transform features lose 
human-tractable meaning and a problem such as this may be extremely difficult 
to track down.

Second, the native Scala hashing function performs badly on longer strings, 
exhibiting [200-500% higher collision 
rates|https://gist.github.com/ssimeonov/eb12fcda75615e4a8d46] than, for 
example, 
[MurmurHash3|http://www.scala-lang.org/api/2.10.4/#scala.util.hashing.MurmurHash3$]
 which is also included in the standard Scala libraries and is the hashing 
choice of fast learners such as Vowpal Wabbit and others. If Spark users apply 
{{HashingTF}} only to very short, dictionary-like strings the hashing function 
choice will not be a big problem but why have an implementation in MLlib with 
this limitation when there is a better implementation readily available in the 
standard Scala library?

Switching to MurmurHash3 solves both problems. If there is agreement that this 
is a good change, I can prepare a PR. 

Note that changing the hash function would mean that models saved with a 
previous version would have to be re-trained. This introduces a problem that's 
orthogonal to breaking changes in APIs: breaking changes related to artifacts, 
e.g., a saved model, produced by a previous version. Is there a policy or best 
practice currently in effect about this? If not, perhaps we should come up with 
a few simple rules about how we communicate these in release notes, etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10476) Add common RDD API methods to standard Scala collections

2015-09-07 Thread Simeon Simeonov (JIRA)
Simeon Simeonov created SPARK-10476:
---

 Summary: Add common RDD API methods to standard Scala collections
 Key: SPARK-10476
 URL: https://issues.apache.org/jira/browse/SPARK-10476
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 1.4.1
Reporter: Simeon Simeonov
Priority: Minor


A common pattern in Spark development is to look for opportunities to leverage 
data locality using mechanisms such as {{mapPartitions}}. Often this happens 
when an existing set of RDD transformations is refactored to improve 
performance. At that point, significant code refactoring may be required 
because the input is {{Iterator\[T]}} as opposed to an RDD. The most common 
examples we've encountered so far involve the {{*ByKey}} methods, {{sample}} 
and {{takeSample}}. We have also observed cases where, due to changes in the 
structure of data use of {{mapPartitions}} is no longer possible and the code 
has to be converted to use the RDD API.

If data manipulation through the RDD API could be applied to the standard Scala 
data structures then refactoring Spark data pipelines would become faster and 
less bug-prone. Also, and this is no small benefit, the thoughtfulness and 
experience of the Spark community could spread to the broader Scala community.

There are multiple approaches to solving this problem, including but not 
limited to creating a set of {{Local*RDD}} classes and/or adding implicit 
conversions.

Here is a simple example meant to be short as opposed to complete or 
performance-optimized:

{code}
implicit class LocalRDD[T](it: Iterator[T]) extends Iterable[T] {
  def this(collection: Iterable[T]) = this(collection.toIterator)
  def iterator = it
}

implicit class LocalPairRDD[K, V](it: Iterator[(K, V)]) extends Iterable[(K, 
V)] {
  def this(collection: Iterable[(K, V)]) = this(collection.toIterator)
  def iterator = it
  def groupByKey() = new LocalPairRDD[K, Iterable[V]](
groupBy(_._1).map { case (k, valuePairs) => (k, valuePairs.map(_._2)) }
  )
}

sc.
  parallelize(Array((1, 10), (2, 10), (1, 20))).
  repartition(1).
  mapPartitions(data => data.groupByKey().toIterator).
  take(2)
// Array[(Int, Iterable[Int])] = Array((2,List(10)), (1,List(10, 20)))
{code} 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10476) Add common RDD operations on standard Scala collections

2015-09-07 Thread Simeon Simeonov (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Simeon Simeonov updated SPARK-10476:

Summary: Add common RDD operations on standard Scala collections  (was: Add 
common RDD API methods to standard Scala collections)

> Add common RDD operations on standard Scala collections
> ---
>
> Key: SPARK-10476
> URL: https://issues.apache.org/jira/browse/SPARK-10476
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 1.4.1
>Reporter: Simeon Simeonov
>Priority: Minor
>  Labels: core, mapPartitions, rdd
>
> A common pattern in Spark development is to look for opportunities to 
> leverage data locality using mechanisms such as {{mapPartitions}}. Often this 
> happens when an existing set of RDD transformations is refactored to improve 
> performance. At that point, significant code refactoring may be required 
> because the input is {{Iterator\[T]}} as opposed to an RDD. The most common 
> examples we've encountered so far involve the {{*ByKey}} methods, {{sample}} 
> and {{takeSample}}. We have also observed cases where, due to changes in the 
> structure of data use of {{mapPartitions}} is no longer possible and the code 
> has to be converted to use the RDD API.
> If data manipulation through the RDD API could be applied to the standard 
> Scala data structures then refactoring Spark data pipelines would become 
> faster and less bug-prone. Also, and this is no small benefit, the 
> thoughtfulness and experience of the Spark community could spread to the 
> broader Scala community.
> There are multiple approaches to solving this problem, including but not 
> limited to creating a set of {{Local*RDD}} classes and/or adding implicit 
> conversions.
> Here is a simple example meant to be short as opposed to complete or 
> performance-optimized:
> {code}
> implicit class LocalRDD[T](it: Iterator[T]) extends Iterable[T] {
>   def this(collection: Iterable[T]) = this(collection.toIterator)
>   def iterator = it
> }
> implicit class LocalPairRDD[K, V](it: Iterator[(K, V)]) extends Iterable[(K, 
> V)] {
>   def this(collection: Iterable[(K, V)]) = this(collection.toIterator)
>   def iterator = it
>   def groupByKey() = new LocalPairRDD[K, Iterable[V]](
> groupBy(_._1).map { case (k, valuePairs) => (k, valuePairs.map(_._2)) }
>   )
> }
> sc.
>   parallelize(Array((1, 10), (2, 10), (1, 20))).
>   repartition(1).
>   mapPartitions(data => data.groupByKey().toIterator).
>   take(2)
> // Array[(Int, Iterable[Int])] = Array((2,List(10)), (1,List(10, 20)))
> {code} 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10476) Enable common RDD operations on standard Scala collections

2015-09-07 Thread Simeon Simeonov (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Simeon Simeonov updated SPARK-10476:

Summary: Enable common RDD operations on standard Scala collections  (was: 
Add common RDD operations on standard Scala collections)

> Enable common RDD operations on standard Scala collections
> --
>
> Key: SPARK-10476
> URL: https://issues.apache.org/jira/browse/SPARK-10476
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 1.4.1
>Reporter: Simeon Simeonov
>Priority: Minor
>  Labels: core, mapPartitions, rdd
>
> A common pattern in Spark development is to look for opportunities to 
> leverage data locality using mechanisms such as {{mapPartitions}}. Often this 
> happens when an existing set of RDD transformations is refactored to improve 
> performance. At that point, significant code refactoring may be required 
> because the input is {{Iterator\[T]}} as opposed to an RDD. The most common 
> examples we've encountered so far involve the {{*ByKey}} methods, {{sample}} 
> and {{takeSample}}. We have also observed cases where, due to changes in the 
> structure of data use of {{mapPartitions}} is no longer possible and the code 
> has to be converted to use the RDD API.
> If data manipulation through the RDD API could be applied to the standard 
> Scala data structures then refactoring Spark data pipelines would become 
> faster and less bug-prone. Also, and this is no small benefit, the 
> thoughtfulness and experience of the Spark community could spread to the 
> broader Scala community.
> There are multiple approaches to solving this problem, including but not 
> limited to creating a set of {{Local*RDD}} classes and/or adding implicit 
> conversions.
> Here is a simple example meant to be short as opposed to complete or 
> performance-optimized:
> {code}
> implicit class LocalRDD[T](it: Iterator[T]) extends Iterable[T] {
>   def this(collection: Iterable[T]) = this(collection.toIterator)
>   def iterator = it
> }
> implicit class LocalPairRDD[K, V](it: Iterator[(K, V)]) extends Iterable[(K, 
> V)] {
>   def this(collection: Iterable[(K, V)]) = this(collection.toIterator)
>   def iterator = it
>   def groupByKey() = new LocalPairRDD[K, Iterable[V]](
> groupBy(_._1).map { case (k, valuePairs) => (k, valuePairs.map(_._2)) }
>   )
> }
> sc.
>   parallelize(Array((1, 10), (2, 10), (1, 20))).
>   repartition(1).
>   mapPartitions(data => data.groupByKey().toIterator).
>   take(2)
> // Array[(Int, Iterable[Int])] = Array((2,List(10)), (1,List(10, 20)))
> {code} 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10186) Add support for more postgres column types

2015-09-07 Thread Simeon Simeonov (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Simeon Simeonov updated SPARK-10186:

Description: 
The specific observations below are based on Postgres 9.4 tables accessed via 
the postgresql-9.4-1201.jdbc41.jar driver. However, based on the behavior, I 
would expect the problem to exists for all external SQL databases.

- *json and jsonb columns generate {{java.sql.SQLException: Unsupported type 
}}*. While it is reasonable to not support dynamic schema discovery of JSON 
columns automatically (it requires two passes over the data), a better behavior 
would be to create a String column and return the JSON.
- *Array columns generate {{java.sql.SQLException: Unsupported type 2003}}*. 
This is true even for simple types, e.g., {{text[]}}. A better behavior would 
be be create an Array column. 
- *Custom type columns are mapped to a String column.* This behavior is harder 
to understand as the schema of a custom type is fixed and therefore mappable to 
a Struct column. The automatic conversion to a string is also inconsistent when 
compared to json and array column handling.

The exceptions are thrown by 
{{org.apache.spark.sql.jdbc.JDBCRDD$.org$apache$spark$sql$jdbc$JDBCRDD$$getCatalystType(JDBCRDD.scala:100)}}
 so this definitely looks like a Spark SQL and not a JDBC problem.

  was:
The specific observations below are based on Postgres 9.4 tables accessed via 
the postgresql-9.4-1201.jdbc41.jar driver. However, based on the behavior, I 
would expect the problem to exists for all external SQL databases.

- *json and jsonb columns generate {{java.sql.SQLException: Unsupported type 
}}*. While it is reasonable to not support dynamic schema discovery of JSON 
columns automatically (it requires two passes over the data), a better behavior 
would be to create a String column and return the JSON.
- *Array columns generate {{java.sql.SQLException: Unsupported type 2003}}*. 
This is true even for simple types, e.g., {{text[]}}. A better behavior would 
be be create an Array column. 
- *Custom type columns are mapped to a String column.* This behavior is harder 
to understand as the schema of a custom type is fixed and therefore mappable to 
a Struct column. The automatic conversion to a string is also inconsistent when 
compared to json and array column handling.

The exceptions are throw by 
{{org.apache.spark.sql.jdbc.JDBCRDD$.org$apache$spark$sql$jdbc$JDBCRDD$$getCatalystType(JDBCRDD.scala:100)}}
 so this definitely looks like a Spark SQL and not a JDBC problem.


> Add support for more postgres column types
> --
>
> Key: SPARK-10186
> URL: https://issues.apache.org/jira/browse/SPARK-10186
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 1.4.1
> Environment: Ubuntu on AWS
>Reporter: Simeon Simeonov
>  Labels: array, json, postgres, sql, struct
>
> The specific observations below are based on Postgres 9.4 tables accessed via 
> the postgresql-9.4-1201.jdbc41.jar driver. However, based on the behavior, I 
> would expect the problem to exists for all external SQL databases.
> - *json and jsonb columns generate {{java.sql.SQLException: Unsupported type 
> }}*. While it is reasonable to not support dynamic schema discovery of 
> JSON columns automatically (it requires two passes over the data), a better 
> behavior would be to create a String column and return the JSON.
> - *Array columns generate {{java.sql.SQLException: Unsupported type 2003}}*. 
> This is true even for simple types, e.g., {{text[]}}. A better behavior would 
> be be create an Array column. 
> - *Custom type columns are mapped to a String column.* This behavior is 
> harder to understand as the schema of a custom type is fixed and therefore 
> mappable to a Struct column. The automatic conversion to a string is also 
> inconsistent when compared to json and array column handling.
> The exceptions are thrown by 
> {{org.apache.spark.sql.jdbc.JDBCRDD$.org$apache$spark$sql$jdbc$JDBCRDD$$getCatalystType(JDBCRDD.scala:100)}}
>  so this definitely looks like a Spark SQL and not a JDBC problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10476) Enable common RDD operations on standard Scala collections

2015-09-07 Thread Simeon Simeonov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14733988#comment-14733988
 ] 

Simeon Simeonov commented on SPARK-10476:
-

No need for the abstraction to leak as all {{Local*RDD}} classes will deal with 
a single partition. The goal is is not processing model compatibility. It's API 
compatibility. That aside, I agree about this being a separate library. It 
would help if some of the core logic, e.g., sampling, was easier to reuse but 
right now it seems quite tightly bound to the RDD implementation.

> Enable common RDD operations on standard Scala collections
> --
>
> Key: SPARK-10476
> URL: https://issues.apache.org/jira/browse/SPARK-10476
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 1.4.1
>Reporter: Simeon Simeonov
>Priority: Minor
>  Labels: core, mapPartitions, rdd
>
> A common pattern in Spark development is to look for opportunities to 
> leverage data locality using mechanisms such as {{mapPartitions}}. Often this 
> happens when an existing set of RDD transformations is refactored to improve 
> performance. At that point, significant code refactoring may be required 
> because the input is {{Iterator\[T]}} as opposed to an RDD. The most common 
> examples we've encountered so far involve the {{*ByKey}} methods, {{sample}} 
> and {{takeSample}}. We have also observed cases where, due to changes in the 
> structure of data use of {{mapPartitions}} is no longer possible and the code 
> has to be converted to use the RDD API.
> If data manipulation through the RDD API could be applied to the standard 
> Scala data structures then refactoring Spark data pipelines would become 
> faster and less bug-prone. Also, and this is no small benefit, the 
> thoughtfulness and experience of the Spark community could spread to the 
> broader Scala community.
> There are multiple approaches to solving this problem, including but not 
> limited to creating a set of {{Local*RDD}} classes and/or adding implicit 
> conversions.
> Here is a simple example meant to be short as opposed to complete or 
> performance-optimized:
> {code}
> implicit class LocalRDD[T](it: Iterator[T]) extends Iterable[T] {
>   def this(collection: Iterable[T]) = this(collection.toIterator)
>   def iterator = it
> }
> implicit class LocalPairRDD[K, V](it: Iterator[(K, V)]) extends Iterable[(K, 
> V)] {
>   def this(collection: Iterable[(K, V)]) = this(collection.toIterator)
>   def iterator = it
>   def groupByKey() = new LocalPairRDD[K, Iterable[V]](
> groupBy(_._1).map { case (k, valuePairs) => (k, valuePairs.map(_._2)) }
>   )
> }
> sc.
>   parallelize(Array((1, 10), (2, 10), (1, 20))).
>   repartition(1).
>   mapPartitions(data => data.groupByKey().toIterator).
>   take(2)
> // Array[(Int, Iterable[Int])] = Array((2,List(10)), (1,List(10, 20)))
> {code} 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7869) Spark Data Frame Fails to Load Postgres Tables with JSONB DataType Columns

2015-08-24 Thread Simeon Simeonov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14709646#comment-14709646
 ] 

Simeon Simeonov commented on SPARK-7869:


This is also a problem for {{json}} columns, not just {{jsonb}} ones. 

It would be nice to get the JSON as a String column, instead of an error.


 Spark Data Frame Fails to Load Postgres Tables with JSONB DataType Columns
 --

 Key: SPARK-7869
 URL: https://issues.apache.org/jira/browse/SPARK-7869
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Affects Versions: 1.3.0, 1.3.1
 Environment: Spark 1.3.1
Reporter: Brad Willard
Priority: Minor

 Most of our tables load into dataframes just fine with postgres. However we 
 have a number of tables leveraging the JSONB datatype. Spark will error and 
 refuse to load this table. While asking for Spark to support JSONB might be a 
 tall order in the short term, it would be great if Spark would at least load 
 the table ignoring the columns it can't load or have it be an option.
 pdf = sql_context.load(source=jdbc, url=url, dbtable=table_of_json)
 Py4JJavaError: An error occurred while calling o41.load.
 : java.sql.SQLException: Unsupported type 
 at org.apache.spark.sql.jdbc.JDBCRDD$.getCatalystType(JDBCRDD.scala:78)
 at org.apache.spark.sql.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:112)
 at org.apache.spark.sql.jdbc.JDBCRelation.init(JDBCRelation.scala:133)
 at 
 org.apache.spark.sql.jdbc.DefaultSource.createRelation(JDBCRelation.scala:121)
 at org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:219)
 at org.apache.spark.sql.SQLContext.load(SQLContext.scala:697)
 at org.apache.spark.sql.SQLContext.load(SQLContext.scala:685)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
 at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
 at py4j.Gateway.invoke(Gateway.java:259)
 at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
 at py4j.commands.CallCommand.execute(CallCommand.java:79)
 at py4j.GatewayConnection.run(GatewayConnection.java:207)
 at java.lang.Thread.run(Thread.java:745)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10186) Inconsistent handling of complex column types in external databases

2015-08-24 Thread Simeon Simeonov (JIRA)
Simeon Simeonov created SPARK-10186:
---

 Summary: Inconsistent handling of complex column types in external 
databases
 Key: SPARK-10186
 URL: https://issues.apache.org/jira/browse/SPARK-10186
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.1
 Environment: Ubuntu on AWS
Reporter: Simeon Simeonov


The specific observations below are based on Postgres 9.4 tables accessed via 
the postgresql-9.4-1201.jdbc41.jar driver. However, based on the behavior, I 
would expect the problem to exists for all external SQL databases.

- *json and jsonb columns generate {{java.sql.SQLException: Unsupported type 
}}*. While it is reasonable to not support dynamic schema discovery of JSON 
columns automatically (it requires two passes over the data), a better behavior 
would be to create a String column and return the JSON.
- *Array columns generate {{java.sql.SQLException: Unsupported type 2003}}*. 
This is true even for simple types, e.g., {{text[]}}. A better behavior would 
be be create an Array column. 
- *Custom type columns are mapped to a String column.* This behavior is harder 
to understand as the schema of a custom type is fixed and therefore mappable to 
a Struct column. The automatic conversion to a string is also inconsistent when 
compared to json and array column handling.

The exceptions are throw by 
{{org.apache.spark.sql.jdbc.JDBCRDD$.org$apache$spark$sql$jdbc$JDBCRDD$$getCatalystType(JDBCRDD.scala:100)}}
 so this definitely looks like a Spark SQL and not a JDBC problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10217) Spark SQL cannot handle ordering directive in ORDER BY clauses with expressions

2015-08-24 Thread Simeon Simeonov (JIRA)
Simeon Simeonov created SPARK-10217:
---

 Summary: Spark SQL cannot handle ordering directive in ORDER BY 
clauses with expressions
 Key: SPARK-10217
 URL: https://issues.apache.org/jira/browse/SPARK-10217
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.1
 Environment: Ubuntu on AWS
Reporter: Simeon Simeonov


Spark SQL supports expressions in ORDER BY clauses, e.g.,

{code}
scala sqlContext.sql(select cnt from test order by (cnt + cnt))
res2: org.apache.spark.sql.DataFrame = [cnt: bigint]
{code}

However, the analyzer gets confused when there is an explicit ordering 
directive (ASC/DESC):

{code}
scala sqlContext.sql(select * from cats order by (cnt + cnt) asc)
15/08/25 04:08:02 INFO ParseDriver: Parsing command: select cnt from test order 
by (cnt + cnt) asc
org.apache.spark.sql.AnalysisException: extraneous input 'asc' expecting EOF 
near 'EOF'; line 1 pos 40
at org.apache.spark.sql.hive.HiveQl$.createPlan(HiveQl.scala:289)
at 
org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:41)
at 
org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:40)
...
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10217) Spark SQL cannot handle ordering directive in ORDER BY clauses with expressions

2015-08-24 Thread Simeon Simeonov (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Simeon Simeonov updated SPARK-10217:

Description: 
Spark SQL supports expressions in ORDER BY clauses, e.g.,

{code}
scala sqlContext.sql(select cnt from test order by (cnt + cnt))
res2: org.apache.spark.sql.DataFrame = [cnt: bigint]
{code}

However, the analyzer gets confused when there is an explicit ordering 
directive (ASC/DESC):

{code}
scala sqlContext.sql(select cnt from test order by (cnt + cnt) asc)
15/08/25 04:08:02 INFO ParseDriver: Parsing command: select cnt from test order 
by (cnt + cnt) asc
org.apache.spark.sql.AnalysisException: extraneous input 'asc' expecting EOF 
near 'EOF'; line 1 pos 40
at org.apache.spark.sql.hive.HiveQl$.createPlan(HiveQl.scala:289)
at 
org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:41)
at 
org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:40)
...
{code}

  was:
Spark SQL supports expressions in ORDER BY clauses, e.g.,

{code}
scala sqlContext.sql(select cnt from test order by (cnt + cnt))
res2: org.apache.spark.sql.DataFrame = [cnt: bigint]
{code}

However, the analyzer gets confused when there is an explicit ordering 
directive (ASC/DESC):

{code}
scala sqlContext.sql(select * from cats order by (cnt + cnt) asc)
15/08/25 04:08:02 INFO ParseDriver: Parsing command: select cnt from test order 
by (cnt + cnt) asc
org.apache.spark.sql.AnalysisException: extraneous input 'asc' expecting EOF 
near 'EOF'; line 1 pos 40
at org.apache.spark.sql.hive.HiveQl$.createPlan(HiveQl.scala:289)
at 
org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:41)
at 
org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:40)
...
{code}


 Spark SQL cannot handle ordering directive in ORDER BY clauses with 
 expressions
 ---

 Key: SPARK-10217
 URL: https://issues.apache.org/jira/browse/SPARK-10217
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.1
 Environment: Ubuntu on AWS
Reporter: Simeon Simeonov
  Labels: SQL, analyzers

 Spark SQL supports expressions in ORDER BY clauses, e.g.,
 {code}
 scala sqlContext.sql(select cnt from test order by (cnt + cnt))
 res2: org.apache.spark.sql.DataFrame = [cnt: bigint]
 {code}
 However, the analyzer gets confused when there is an explicit ordering 
 directive (ASC/DESC):
 {code}
 scala sqlContext.sql(select cnt from test order by (cnt + cnt) asc)
 15/08/25 04:08:02 INFO ParseDriver: Parsing command: select cnt from test 
 order by (cnt + cnt) asc
 org.apache.spark.sql.AnalysisException: extraneous input 'asc' expecting EOF 
 near 'EOF'; line 1 pos 40
   at org.apache.spark.sql.hive.HiveQl$.createPlan(HiveQl.scala:289)
   at 
 org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:41)
   at 
 org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:40)
 ...
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-9813) Incorrect UNION ALL behavior

2015-08-12 Thread Simeon Simeonov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14693512#comment-14693512
 ] 

Simeon Simeonov edited comment on SPARK-9813 at 8/12/15 1:59 PM:
-

[~hvanhovell] that's a great, tight demonstration of the problem.

I'd argue that making UNION ALL work well is very important because this is the 
main tool one uses in SQL to deal with incompatible schema changes that require 
uniform querying. It's typically done via CREATE VIEW of a UNION ALL of N 
SELECT statements that produce the desired schema. It's one of the go-to 
strategies for managing change in large-scale real-world deployments.

Unfortunately, views don't work in Spark SQL right now 
([https://issues.apache.org/jira/browse/SPARK-9342]) partly due to serious 
metastore management problems 
([https://issues.apache.org/jira/browse/SPARK-9764]). However, views are, 
ultimately, sugar. If we get UNION ALL working correctly a Spark SQL user can 
always just copy the UNION ALL expression where they would have preferred to 
use a view.


was (Author: simeons):
[~hvanhovell] that's a great, tight demonstration of the problem.

I'd argue that making UNION ALL work well is very important because this is the 
main tool one uses in SQL to deal with incompatible schema changes that require 
uniform querying. It's typically done via CREATE VIEW of a UNION ALL of N 
SELECT statements that produce the desired schema. It's one of the go-to 
strategies for managing change in large-scale real-world deployments.

Unfortunately, views don't work in Spark SQL right now 
[https://issues.apache.org/jira/browse/SPARK-9342] partly due to serious 
metastore management problems 
[https://issues.apache.org/jira/browse/SPARK-9764]. However, views are, 
ultimately, sugar. If we get UNION ALL working correctly a Spark SQL user can 
always just copy the UNION ALL expression where they would have preferred to 
use a view.

 Incorrect UNION ALL behavior
 

 Key: SPARK-9813
 URL: https://issues.apache.org/jira/browse/SPARK-9813
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, SQL
Affects Versions: 1.4.1
 Environment: Ubuntu on AWS
Reporter: Simeon Simeonov
  Labels: sql, union

 According to the [Hive Language 
 Manual|https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Union] 
 for UNION ALL:
 {quote}
 The number and names of columns returned by each select_statement have to be 
 the same. Otherwise, a schema error is thrown.
 {quote}
 Spark SQL silently swallows an error when the tables being joined with UNION 
 ALL have the same number of columns but different names.
 Reproducible example:
 {code}
 // This test is meant to run in spark-shell
 import java.io.File
 import java.io.PrintWriter
 import org.apache.spark.sql.hive.HiveContext
 import org.apache.spark.sql.SaveMode
 val ctx = sqlContext.asInstanceOf[HiveContext]
 import ctx.implicits._
 def dataPath(name:String) = sys.env(HOME) + / + name + .jsonlines
 def tempTable(name: String, json: String) = {
   val path = dataPath(name)
   new PrintWriter(path) { write(json); close }
   ctx.read.json(file:// + path).registerTempTable(name)
 }
 // Note category vs. cat names of first column
 tempTable(test_one, {category : A, num : 5})
 tempTable(test_another, {cat : A, num : 5})
 //  ++---+
 //  |category|num|
 //  ++---+
 //  |   A|  5|
 //  |   A|  5|
 //  ++---+
 //
 //  Instead, an error should have been generated due to incompatible schema
 ctx.sql(select * from test_one union all select * from test_another).show
 // Cleanup
 new File(dataPath(test_one)).delete()
 new File(dataPath(test_another)).delete()
 {code}
 When the number of columns is different, Spark can even mix in datatypes. 
 Reproducible example (requires a new spark-shell session):
 {code}
 // This test is meant to run in spark-shell
 import java.io.File
 import java.io.PrintWriter
 import org.apache.spark.sql.hive.HiveContext
 import org.apache.spark.sql.SaveMode
 val ctx = sqlContext.asInstanceOf[HiveContext]
 import ctx.implicits._
 def dataPath(name:String) = sys.env(HOME) + / + name + .jsonlines
 def tempTable(name: String, json: String) = {
   val path = dataPath(name)
   new PrintWriter(path) { write(json); close }
   ctx.read.json(file:// + path).registerTempTable(name)
 }
 // Note test_another is missing category column
 tempTable(test_one, {category : A, num : 5})
 tempTable(test_another, {num : 5})
 //  ++
 //  |category|
 //  ++
 //  |   A|
 //  |   5| 
 //  ++
 //
 //  Instead, an error should have been generated due to incompatible schema
 ctx.sql(select * from test_one union all select * from test_another).show
 // Cleanup
 new File(dataPath(test_one)).delete()
 new 

[jira] [Comment Edited] (SPARK-9813) Incorrect UNION ALL behavior

2015-08-12 Thread Simeon Simeonov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14693512#comment-14693512
 ] 

Simeon Simeonov edited comment on SPARK-9813 at 8/12/15 2:00 PM:
-

[~hvanhovell] that's a great, tight demonstration of the problem.

I'd argue that making UNION ALL work well is very important because this is the 
main tool one uses in SQL to deal with incompatible schema changes that require 
uniform querying. It's typically done via loading incompatible data in separate 
tables and then using a CREATE VIEW of a UNION ALL of N SELECT statements that 
produce the desired schema. It's one of the go-to strategies for managing 
change in large-scale real-world deployments.

Unfortunately, views don't work in Spark SQL right now 
([https://issues.apache.org/jira/browse/SPARK-9342]) partly due to serious 
metastore management problems 
([https://issues.apache.org/jira/browse/SPARK-9764]). However, views are, 
ultimately, sugar. If we get UNION ALL working correctly a Spark SQL user can 
always just copy the UNION ALL expression where they would have preferred to 
use a view.


was (Author: simeons):
[~hvanhovell] that's a great, tight demonstration of the problem.

I'd argue that making UNION ALL work well is very important because this is the 
main tool one uses in SQL to deal with incompatible schema changes that require 
uniform querying. It's typically done via CREATE VIEW of a UNION ALL of N 
SELECT statements that produce the desired schema. It's one of the go-to 
strategies for managing change in large-scale real-world deployments.

Unfortunately, views don't work in Spark SQL right now 
([https://issues.apache.org/jira/browse/SPARK-9342]) partly due to serious 
metastore management problems 
([https://issues.apache.org/jira/browse/SPARK-9764]). However, views are, 
ultimately, sugar. If we get UNION ALL working correctly a Spark SQL user can 
always just copy the UNION ALL expression where they would have preferred to 
use a view.

 Incorrect UNION ALL behavior
 

 Key: SPARK-9813
 URL: https://issues.apache.org/jira/browse/SPARK-9813
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, SQL
Affects Versions: 1.4.1
 Environment: Ubuntu on AWS
Reporter: Simeon Simeonov
  Labels: sql, union

 According to the [Hive Language 
 Manual|https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Union] 
 for UNION ALL:
 {quote}
 The number and names of columns returned by each select_statement have to be 
 the same. Otherwise, a schema error is thrown.
 {quote}
 Spark SQL silently swallows an error when the tables being joined with UNION 
 ALL have the same number of columns but different names.
 Reproducible example:
 {code}
 // This test is meant to run in spark-shell
 import java.io.File
 import java.io.PrintWriter
 import org.apache.spark.sql.hive.HiveContext
 import org.apache.spark.sql.SaveMode
 val ctx = sqlContext.asInstanceOf[HiveContext]
 import ctx.implicits._
 def dataPath(name:String) = sys.env(HOME) + / + name + .jsonlines
 def tempTable(name: String, json: String) = {
   val path = dataPath(name)
   new PrintWriter(path) { write(json); close }
   ctx.read.json(file:// + path).registerTempTable(name)
 }
 // Note category vs. cat names of first column
 tempTable(test_one, {category : A, num : 5})
 tempTable(test_another, {cat : A, num : 5})
 //  ++---+
 //  |category|num|
 //  ++---+
 //  |   A|  5|
 //  |   A|  5|
 //  ++---+
 //
 //  Instead, an error should have been generated due to incompatible schema
 ctx.sql(select * from test_one union all select * from test_another).show
 // Cleanup
 new File(dataPath(test_one)).delete()
 new File(dataPath(test_another)).delete()
 {code}
 When the number of columns is different, Spark can even mix in datatypes. 
 Reproducible example (requires a new spark-shell session):
 {code}
 // This test is meant to run in spark-shell
 import java.io.File
 import java.io.PrintWriter
 import org.apache.spark.sql.hive.HiveContext
 import org.apache.spark.sql.SaveMode
 val ctx = sqlContext.asInstanceOf[HiveContext]
 import ctx.implicits._
 def dataPath(name:String) = sys.env(HOME) + / + name + .jsonlines
 def tempTable(name: String, json: String) = {
   val path = dataPath(name)
   new PrintWriter(path) { write(json); close }
   ctx.read.json(file:// + path).registerTempTable(name)
 }
 // Note test_another is missing category column
 tempTable(test_one, {category : A, num : 5})
 tempTable(test_another, {num : 5})
 //  ++
 //  |category|
 //  ++
 //  |   A|
 //  |   5| 
 //  ++
 //
 //  Instead, an error should have been generated due to incompatible schema
 ctx.sql(select * from test_one union all select * from test_another).show
 // 

[jira] [Commented] (SPARK-9813) Incorrect UNION ALL behavior

2015-08-12 Thread Simeon Simeonov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14693512#comment-14693512
 ] 

Simeon Simeonov commented on SPARK-9813:


[~hvanhovell] that's a great, tight demonstration of the problem.

I'd argue that making UNION ALL work well is very important because this is the 
main tool one uses in SQL to deal with incompatible schema changes that require 
uniform querying. It's typically done via CREATE VIEW of a UNION ALL of N 
SELECT statements that produce the desired schema. It's one of the go-to 
strategies for managing change in large-scale real-world deployments.

Unfortunately, views don't work in Spark SQL right now 
[https://issues.apache.org/jira/browse/SPARK-9342] partly due to serious 
metastore management problems 
[https://issues.apache.org/jira/browse/SPARK-9764]. However, views are, 
ultimately, sugar. If we get UNION ALL working correctly a Spark SQL user can 
always just copy the UNION ALL expression where they would have preferred to 
use a view.

 Incorrect UNION ALL behavior
 

 Key: SPARK-9813
 URL: https://issues.apache.org/jira/browse/SPARK-9813
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, SQL
Affects Versions: 1.4.1
 Environment: Ubuntu on AWS
Reporter: Simeon Simeonov
  Labels: sql, union

 According to the [Hive Language 
 Manual|https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Union] 
 for UNION ALL:
 {quote}
 The number and names of columns returned by each select_statement have to be 
 the same. Otherwise, a schema error is thrown.
 {quote}
 Spark SQL silently swallows an error when the tables being joined with UNION 
 ALL have the same number of columns but different names.
 Reproducible example:
 {code}
 // This test is meant to run in spark-shell
 import java.io.File
 import java.io.PrintWriter
 import org.apache.spark.sql.hive.HiveContext
 import org.apache.spark.sql.SaveMode
 val ctx = sqlContext.asInstanceOf[HiveContext]
 import ctx.implicits._
 def dataPath(name:String) = sys.env(HOME) + / + name + .jsonlines
 def tempTable(name: String, json: String) = {
   val path = dataPath(name)
   new PrintWriter(path) { write(json); close }
   ctx.read.json(file:// + path).registerTempTable(name)
 }
 // Note category vs. cat names of first column
 tempTable(test_one, {category : A, num : 5})
 tempTable(test_another, {cat : A, num : 5})
 //  ++---+
 //  |category|num|
 //  ++---+
 //  |   A|  5|
 //  |   A|  5|
 //  ++---+
 //
 //  Instead, an error should have been generated due to incompatible schema
 ctx.sql(select * from test_one union all select * from test_another).show
 // Cleanup
 new File(dataPath(test_one)).delete()
 new File(dataPath(test_another)).delete()
 {code}
 When the number of columns is different, Spark can even mix in datatypes. 
 Reproducible example (requires a new spark-shell session):
 {code}
 // This test is meant to run in spark-shell
 import java.io.File
 import java.io.PrintWriter
 import org.apache.spark.sql.hive.HiveContext
 import org.apache.spark.sql.SaveMode
 val ctx = sqlContext.asInstanceOf[HiveContext]
 import ctx.implicits._
 def dataPath(name:String) = sys.env(HOME) + / + name + .jsonlines
 def tempTable(name: String, json: String) = {
   val path = dataPath(name)
   new PrintWriter(path) { write(json); close }
   ctx.read.json(file:// + path).registerTempTable(name)
 }
 // Note test_another is missing category column
 tempTable(test_one, {category : A, num : 5})
 tempTable(test_another, {num : 5})
 //  ++
 //  |category|
 //  ++
 //  |   A|
 //  |   5| 
 //  ++
 //
 //  Instead, an error should have been generated due to incompatible schema
 ctx.sql(select * from test_one union all select * from test_another).show
 // Cleanup
 new File(dataPath(test_one)).delete()
 new File(dataPath(test_another)).delete()
 {code}
 At other times, when the schema are complex, Spark SQL produces a misleading 
 error about an unresolved Union operator:
 {code}
 scala ctx.sql(select * from view_clicks
  | union all
  | select * from view_clicks_aug
  | )
 15/08/11 02:40:25 INFO ParseDriver: Parsing command: select * from view_clicks
 union all
 select * from view_clicks_aug
 15/08/11 02:40:25 INFO ParseDriver: Parse Completed
 15/08/11 02:40:25 INFO HiveMetaStore: 0: get_table : db=default 
 tbl=view_clicks
 15/08/11 02:40:25 INFO audit: ugi=ubuntu  ip=unknown-ip-addr  
 cmd=get_table : db=default tbl=view_clicks
 15/08/11 02:40:25 INFO HiveMetaStore: 0: get_table : db=default 
 tbl=view_clicks
 15/08/11 02:40:25 INFO audit: ugi=ubuntu  ip=unknown-ip-addr  
 cmd=get_table : db=default tbl=view_clicks
 15/08/11 02:40:25 INFO HiveMetaStore: 0: get_table : db=default 
 tbl=view_clicks_aug
 

[jira] [Commented] (SPARK-9345) Failure to cleanup on exceptions causes persistent I/O problems later on

2015-08-12 Thread Simeon Simeonov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14693537#comment-14693537
 ] 

Simeon Simeonov commented on SPARK-9345:


[~marmbrus] Yes, Michael: {{kill -9}} is the way some of the time. However, 
there are types of OOM exceptions that keep {{spark-shell}} running but create 
side effects. One example I've discovered recently is temporary folders in the 
Hive managed table space in HDFS not getting cleaned up which causes exceptions 
when, say, {{saveAsTable}} with the same table name runs later.

 Failure to cleanup on exceptions causes persistent I/O problems later on
 

 Key: SPARK-9345
 URL: https://issues.apache.org/jira/browse/SPARK-9345
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell, SQL
Affects Versions: 1.3.1, 1.4.0, 1.4.1
 Environment: Ubuntu on AWS
Reporter: Simeon Simeonov
Priority: Minor

 When using spark-shell in local mode, I've observed the following behavior on 
 a number of nodes:
 # Some operation generates an exception related to Spark SQL processing via 
 {{HiveContext}}.
 # From that point on, nothing could be written to Hive with {{saveAsTable}}.
 # Another identically-configured version of Spark on the same machine may not 
 exhibit the problem initially but, with enough exceptions, it will start 
 exhibiting the problem also.
 # A new identically-configured installation of the same version on the same 
 machine would exhibit the problem.
 The error is always related to inability to create a temporary folder on HDFS:
 {code}
 15/07/25 16:03:35 ERROR InsertIntoHadoopFsRelation: Aborting task.
 java.io.IOException: Mkdirs failed to create 
 file:/user/hive/warehouse/test/_temporary/0/_temporary/attempt_201507251603_0001_m_01_0
  (exists=false, cwd=file:/home/ubuntu)
   at 
 org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:442)
   at 
 org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:428)
   at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:908)
   at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:889)
   at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:786)
   at parquet.hadoop.ParquetFileWriter.init(ParquetFileWriter.java:154)
   at 
 parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:279)
   at 
 parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:252)
   at 
 org.apache.spark.sql.parquet.ParquetOutputWriter.init(newParquet.scala:83)
   at 
 org.apache.spark.sql.parquet.ParquetRelation2$$anon$4.newInstance(newParquet.scala:229)
   at 
 org.apache.spark.sql.sources.DefaultWriterContainer.initWriters(commands.scala:470)
   at 
 org.apache.spark.sql.sources.BaseWriterContainer.executorSideSetup(commands.scala:360)
   at 
 org.apache.spark.sql.sources.InsertIntoHadoopFsRelation.org$apache$spark$sql$sources$InsertIntoHadoopFsRelation$$writeRows$1(commands.scala:172)
   at 
 org.apache.spark.sql.sources.InsertIntoHadoopFsRelation$$anonfun$insert$1.apply(commands.scala:160)
   at 
 org.apache.spark.sql.sources.InsertIntoHadoopFsRelation$$anonfun$insert$1.apply(commands.scala:160)
   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
   at org.apache.spark.scheduler.Task.run(Task.scala:70)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:745)
 ...
 {code}
 The behavior does not seem related to HDFS as it persists even if the HDFS 
 volume is reformatted. 
 The behavior is difficult to reproduce reliably but consistently observable 
 with sufficient Spark SQL experimentation (dozens of exceptions arising from 
 Spark SQL processing). 
 The likelihood of this happening goes up substantially if some Spark SQL 
 operation runs out of memory, which suggests
 that the problem is related to cleanup.
 In this gist ([https://gist.github.com/ssimeonov/72a64947bc33628d2d11]) you 
 can see how on the same machine, identically configured 1.3.1 and 1.4.1 
 versions sharing the same HDFS system and Hive metastore, behave differently. 
 1.3.1 can write to Hive. 1.4.1 cannot. The behavior started happening on 
 1.4.1 after an out of memory exception on a large job. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-9813) Incorrect UNION ALL behavior

2015-08-11 Thread Simeon Simeonov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14681326#comment-14681326
 ] 

Simeon Simeonov edited comment on SPARK-9813 at 8/11/15 6:47 AM:
-

[~hvanhovell] Oracle requires the number of columns to be the same and the data 
types to be compatible. (See 
http://docs.oracle.com/cd/B19306_01/server.102/b14200/queries004.htm) If we 
take that approach with Spark, then:


- The first case would be OK (but different from Hive, which will cause its own 
set of problems as there is essentially no documentation on Spark SQL so 
everyone goes to the Hive Language Manual)

- The second case would still be a bug because (a) the number of columns are 
different and (b) a numeric column is mixed into a string column

- The third case still produces an opaque and confusing exception.


was (Author: simeons):
[~hvanhovell] Oracle requires the number of columns to be the same and the data 
types to be compatible. (See 
http://docs.oracle.com/cd/B19306_01/server.102/b14200/queries004.htm) If we 
take that approach with Spark, then:


- The first case would be OK (but different from Hive, which will cause its own 
set of problems as there is essentially no documentation on Spark SQL so 
everyone goes to the Hive Language Manual)

- The second case would still be a bug because (a) the number of columns were 
different and (b) a numeric column was mixed into a string column

- The third case still produces an opaque and confusing exception.

 Incorrect UNION ALL behavior
 

 Key: SPARK-9813
 URL: https://issues.apache.org/jira/browse/SPARK-9813
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, SQL
Affects Versions: 1.4.1
 Environment: Ubuntu on AWS
Reporter: Simeon Simeonov
  Labels: sql, union

 According to the [Hive Language 
 Manual|https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Union] 
 for UNION ALL:
 {quote}
 The number and names of columns returned by each select_statement have to be 
 the same. Otherwise, a schema error is thrown.
 {quote}
 Spark SQL silently swallows an error when the tables being joined with UNION 
 ALL have the same number of columns but different names.
 Reproducible example:
 {code}
 // This test is meant to run in spark-shell
 import java.io.File
 import java.io.PrintWriter
 import org.apache.spark.sql.hive.HiveContext
 import org.apache.spark.sql.SaveMode
 val ctx = sqlContext.asInstanceOf[HiveContext]
 import ctx.implicits._
 def dataPath(name:String) = sys.env(HOME) + / + name + .jsonlines
 def tempTable(name: String, json: String) = {
   val path = dataPath(name)
   new PrintWriter(path) { write(json); close }
   ctx.read.json(file:// + path).registerTempTable(name)
 }
 // Note category vs. cat names of first column
 tempTable(test_one, {category : A, num : 5})
 tempTable(test_another, {cat : A, num : 5})
 //  ++---+
 //  |category|num|
 //  ++---+
 //  |   A|  5|
 //  |   A|  5|
 //  ++---+
 //
 //  Instead, an error should have been generated due to incompatible schema
 ctx.sql(select * from test_one union all select * from test_another).show
 // Cleanup
 new File(dataPath(test_one)).delete()
 new File(dataPath(test_another)).delete()
 {code}
 When the number of columns is different, Spark can even mix in datatypes. 
 Reproducible example (requires a new spark-shell session):
 {code}
 // This test is meant to run in spark-shell
 import java.io.File
 import java.io.PrintWriter
 import org.apache.spark.sql.hive.HiveContext
 import org.apache.spark.sql.SaveMode
 val ctx = sqlContext.asInstanceOf[HiveContext]
 import ctx.implicits._
 def dataPath(name:String) = sys.env(HOME) + / + name + .jsonlines
 def tempTable(name: String, json: String) = {
   val path = dataPath(name)
   new PrintWriter(path) { write(json); close }
   ctx.read.json(file:// + path).registerTempTable(name)
 }
 // Note test_another is missing category column
 tempTable(test_one, {category : A, num : 5})
 tempTable(test_another, {num : 5})
 //  ++
 //  |category|
 //  ++
 //  |   A|
 //  |   5| 
 //  ++
 //
 //  Instead, an error should have been generated due to incompatible schema
 ctx.sql(select * from test_one union all select * from test_another).show
 // Cleanup
 new File(dataPath(test_one)).delete()
 new File(dataPath(test_another)).delete()
 {code}
 At other times, when the schema are complex, Spark SQL produces a misleading 
 error about an unresolved Union operator:
 {code}
 scala ctx.sql(select * from view_clicks
  | union all
  | select * from view_clicks_aug
  | )
 15/08/11 02:40:25 INFO ParseDriver: Parsing command: select * from view_clicks
 union all
 select * from view_clicks_aug
 15/08/11 02:40:25 INFO ParseDriver: Parse 

[jira] [Commented] (SPARK-9813) Incorrect UNION ALL behavior

2015-08-11 Thread Simeon Simeonov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14681326#comment-14681326
 ] 

Simeon Simeonov commented on SPARK-9813:


[~hvanhovell] Oracle requires the number of columns to be the same and the data 
types to be compatible. (See 
http://docs.oracle.com/cd/B19306_01/server.102/b14200/queries004.htm) If we 
take that approach with Spark, then:


- The first case would be OK (but different from Hive, which will cause it's 
own set of problems as there is essentially no documentation on Spark SQL so 
everyone goes to the Hive Language Manual)

- The second case would still be a bug because (a) the number of columns were 
different and (b) a numeric column was mixed into a string column

- The third case still produces an opaque and confusing exception.

 Incorrect UNION ALL behavior
 

 Key: SPARK-9813
 URL: https://issues.apache.org/jira/browse/SPARK-9813
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, SQL
Affects Versions: 1.4.1
 Environment: Ubuntu on AWS
Reporter: Simeon Simeonov
  Labels: sql, union

 According to the [Hive Language 
 Manual|https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Union] 
 for UNION ALL:
 {quote}
 The number and names of columns returned by each select_statement have to be 
 the same. Otherwise, a schema error is thrown.
 {quote}
 Spark SQL silently swallows an error when the tables being joined with UNION 
 ALL have the same number of columns but different names.
 Reproducible example:
 {code}
 // This test is meant to run in spark-shell
 import java.io.File
 import java.io.PrintWriter
 import org.apache.spark.sql.hive.HiveContext
 import org.apache.spark.sql.SaveMode
 val ctx = sqlContext.asInstanceOf[HiveContext]
 import ctx.implicits._
 def dataPath(name:String) = sys.env(HOME) + / + name + .jsonlines
 def tempTable(name: String, json: String) = {
   val path = dataPath(name)
   new PrintWriter(path) { write(json); close }
   ctx.read.json(file:// + path).registerTempTable(name)
 }
 // Note category vs. cat names of first column
 tempTable(test_one, {category : A, num : 5})
 tempTable(test_another, {cat : A, num : 5})
 //  ++---+
 //  |category|num|
 //  ++---+
 //  |   A|  5|
 //  |   A|  5|
 //  ++---+
 //
 //  Instead, an error should have been generated due to incompatible schema
 ctx.sql(select * from test_one union all select * from test_another).show
 // Cleanup
 new File(dataPath(test_one)).delete()
 new File(dataPath(test_another)).delete()
 {code}
 When the number of columns is different, Spark can even mix in datatypes. 
 Reproducible example (requires a new spark-shell session):
 {code}
 // This test is meant to run in spark-shell
 import java.io.File
 import java.io.PrintWriter
 import org.apache.spark.sql.hive.HiveContext
 import org.apache.spark.sql.SaveMode
 val ctx = sqlContext.asInstanceOf[HiveContext]
 import ctx.implicits._
 def dataPath(name:String) = sys.env(HOME) + / + name + .jsonlines
 def tempTable(name: String, json: String) = {
   val path = dataPath(name)
   new PrintWriter(path) { write(json); close }
   ctx.read.json(file:// + path).registerTempTable(name)
 }
 // Note test_another is missing category column
 tempTable(test_one, {category : A, num : 5})
 tempTable(test_another, {num : 5})
 //  ++
 //  |category|
 //  ++
 //  |   A|
 //  |   5| 
 //  ++
 //
 //  Instead, an error should have been generated due to incompatible schema
 ctx.sql(select * from test_one union all select * from test_another).show
 // Cleanup
 new File(dataPath(test_one)).delete()
 new File(dataPath(test_another)).delete()
 {code}
 At other times, when the schema are complex, Spark SQL produces a misleading 
 error about an unresolved Union operator:
 {code}
 scala ctx.sql(select * from view_clicks
  | union all
  | select * from view_clicks_aug
  | )
 15/08/11 02:40:25 INFO ParseDriver: Parsing command: select * from view_clicks
 union all
 select * from view_clicks_aug
 15/08/11 02:40:25 INFO ParseDriver: Parse Completed
 15/08/11 02:40:25 INFO HiveMetaStore: 0: get_table : db=default 
 tbl=view_clicks
 15/08/11 02:40:25 INFO audit: ugi=ubuntu  ip=unknown-ip-addr  
 cmd=get_table : db=default tbl=view_clicks
 15/08/11 02:40:25 INFO HiveMetaStore: 0: get_table : db=default 
 tbl=view_clicks
 15/08/11 02:40:25 INFO audit: ugi=ubuntu  ip=unknown-ip-addr  
 cmd=get_table : db=default tbl=view_clicks
 15/08/11 02:40:25 INFO HiveMetaStore: 0: get_table : db=default 
 tbl=view_clicks_aug
 15/08/11 02:40:25 INFO audit: ugi=ubuntu  ip=unknown-ip-addr  
 cmd=get_table : db=default tbl=view_clicks_aug
 15/08/11 02:40:25 INFO HiveMetaStore: 0: get_table : db=default 
 tbl=view_clicks_aug
 15/08/11 

[jira] [Comment Edited] (SPARK-9813) Incorrect UNION ALL behavior

2015-08-11 Thread Simeon Simeonov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14681326#comment-14681326
 ] 

Simeon Simeonov edited comment on SPARK-9813 at 8/11/15 6:46 AM:
-

[~hvanhovell] Oracle requires the number of columns to be the same and the data 
types to be compatible. (See 
http://docs.oracle.com/cd/B19306_01/server.102/b14200/queries004.htm) If we 
take that approach with Spark, then:


- The first case would be OK (but different from Hive, which will cause its own 
set of problems as there is essentially no documentation on Spark SQL so 
everyone goes to the Hive Language Manual)

- The second case would still be a bug because (a) the number of columns were 
different and (b) a numeric column was mixed into a string column

- The third case still produces an opaque and confusing exception.


was (Author: simeons):
[~hvanhovell] Oracle requires the number of columns to be the same and the data 
types to be compatible. (See 
http://docs.oracle.com/cd/B19306_01/server.102/b14200/queries004.htm) If we 
take that approach with Spark, then:


- The first case would be OK (but different from Hive, which will cause it's 
own set of problems as there is essentially no documentation on Spark SQL so 
everyone goes to the Hive Language Manual)

- The second case would still be a bug because (a) the number of columns were 
different and (b) a numeric column was mixed into a string column

- The third case still produces an opaque and confusing exception.

 Incorrect UNION ALL behavior
 

 Key: SPARK-9813
 URL: https://issues.apache.org/jira/browse/SPARK-9813
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, SQL
Affects Versions: 1.4.1
 Environment: Ubuntu on AWS
Reporter: Simeon Simeonov
  Labels: sql, union

 According to the [Hive Language 
 Manual|https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Union] 
 for UNION ALL:
 {quote}
 The number and names of columns returned by each select_statement have to be 
 the same. Otherwise, a schema error is thrown.
 {quote}
 Spark SQL silently swallows an error when the tables being joined with UNION 
 ALL have the same number of columns but different names.
 Reproducible example:
 {code}
 // This test is meant to run in spark-shell
 import java.io.File
 import java.io.PrintWriter
 import org.apache.spark.sql.hive.HiveContext
 import org.apache.spark.sql.SaveMode
 val ctx = sqlContext.asInstanceOf[HiveContext]
 import ctx.implicits._
 def dataPath(name:String) = sys.env(HOME) + / + name + .jsonlines
 def tempTable(name: String, json: String) = {
   val path = dataPath(name)
   new PrintWriter(path) { write(json); close }
   ctx.read.json(file:// + path).registerTempTable(name)
 }
 // Note category vs. cat names of first column
 tempTable(test_one, {category : A, num : 5})
 tempTable(test_another, {cat : A, num : 5})
 //  ++---+
 //  |category|num|
 //  ++---+
 //  |   A|  5|
 //  |   A|  5|
 //  ++---+
 //
 //  Instead, an error should have been generated due to incompatible schema
 ctx.sql(select * from test_one union all select * from test_another).show
 // Cleanup
 new File(dataPath(test_one)).delete()
 new File(dataPath(test_another)).delete()
 {code}
 When the number of columns is different, Spark can even mix in datatypes. 
 Reproducible example (requires a new spark-shell session):
 {code}
 // This test is meant to run in spark-shell
 import java.io.File
 import java.io.PrintWriter
 import org.apache.spark.sql.hive.HiveContext
 import org.apache.spark.sql.SaveMode
 val ctx = sqlContext.asInstanceOf[HiveContext]
 import ctx.implicits._
 def dataPath(name:String) = sys.env(HOME) + / + name + .jsonlines
 def tempTable(name: String, json: String) = {
   val path = dataPath(name)
   new PrintWriter(path) { write(json); close }
   ctx.read.json(file:// + path).registerTempTable(name)
 }
 // Note test_another is missing category column
 tempTable(test_one, {category : A, num : 5})
 tempTable(test_another, {num : 5})
 //  ++
 //  |category|
 //  ++
 //  |   A|
 //  |   5| 
 //  ++
 //
 //  Instead, an error should have been generated due to incompatible schema
 ctx.sql(select * from test_one union all select * from test_another).show
 // Cleanup
 new File(dataPath(test_one)).delete()
 new File(dataPath(test_another)).delete()
 {code}
 At other times, when the schema are complex, Spark SQL produces a misleading 
 error about an unresolved Union operator:
 {code}
 scala ctx.sql(select * from view_clicks
  | union all
  | select * from view_clicks_aug
  | )
 15/08/11 02:40:25 INFO ParseDriver: Parsing command: select * from view_clicks
 union all
 select * from view_clicks_aug
 15/08/11 02:40:25 INFO ParseDriver: Parse 

[jira] [Created] (SPARK-9813) Incorrect UNION ALL behavior

2015-08-10 Thread Simeon Simeonov (JIRA)
Simeon Simeonov created SPARK-9813:
--

 Summary: Incorrect UNION ALL behavior
 Key: SPARK-9813
 URL: https://issues.apache.org/jira/browse/SPARK-9813
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, SQL
Affects Versions: 1.4.1
 Environment: Ubuntu on AWS
Reporter: Simeon Simeonov


According to the [Hive Language 
Manual|https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Union] 
for UNION ALL:

{quote}
The number and names of columns returned by each select_statement have to be 
the same. Otherwise, a schema error is thrown.
{quote}

Spark SQL silently swallows an error when the tables being joined with UNION 
ALL have the same number of columns but different names.

Reproducible example:

{code}
// This test is meant to run in spark-shell
import java.io.File
import java.io.PrintWriter
import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.sql.SaveMode

val ctx = sqlContext.asInstanceOf[HiveContext]
import ctx.implicits._

def dataPath(name:String) = sys.env(HOME) + / + name + .jsonlines

def tempTable(name: String, json: String) = {
  val path = dataPath(name)
  new PrintWriter(path) { write(json); close }
  ctx.read.json(file:// + path).registerTempTable(name)
}

// Note category vs. cat names of first column
tempTable(test_one, {category : A, num : 5})
tempTable(test_another, {cat : A, num : 5})

//  ++---+
//  |category|num|
//  ++---+
//  |   A|  5|
//  |   A|  5|
//  ++---+
//
//  Instead, an error should have been generated due to incompatible schema
ctx.sql(select * from test_one union all select * from test_another).show

// Cleanup
new File(dataPath(test_one)).delete()
new File(dataPath(test_another)).delete()
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9813) Incorrect UNION ALL behavior

2015-08-10 Thread Simeon Simeonov (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Simeon Simeonov updated SPARK-9813:
---
Description: 
According to the [Hive Language 
Manual|https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Union] 
for UNION ALL:

{quote}
The number and names of columns returned by each select_statement have to be 
the same. Otherwise, a schema error is thrown.
{quote}

Spark SQL silently swallows an error when the tables being joined with UNION 
ALL have the same number of columns but different names.

Reproducible example:

{code}
// This test is meant to run in spark-shell
import java.io.File
import java.io.PrintWriter
import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.sql.SaveMode

val ctx = sqlContext.asInstanceOf[HiveContext]
import ctx.implicits._

def dataPath(name:String) = sys.env(HOME) + / + name + .jsonlines

def tempTable(name: String, json: String) = {
  val path = dataPath(name)
  new PrintWriter(path) { write(json); close }
  ctx.read.json(file:// + path).registerTempTable(name)
}

// Note category vs. cat names of first column
tempTable(test_one, {category : A, num : 5})
tempTable(test_another, {cat : A, num : 5})

//  ++---+
//  |category|num|
//  ++---+
//  |   A|  5|
//  |   A|  5|
//  ++---+
//
//  Instead, an error should have been generated due to incompatible schema
ctx.sql(select * from test_one union all select * from test_another).show

// Cleanup
new File(dataPath(test_one)).delete()
new File(dataPath(test_another)).delete()
{code}

When the number of columns is different, Spark can even mix in datatypes. 

Reproducible example (requires a new spark-shell session):

{code}
// This test is meant to run in spark-shell
import java.io.File
import java.io.PrintWriter
import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.sql.SaveMode

val ctx = sqlContext.asInstanceOf[HiveContext]
import ctx.implicits._

def dataPath(name:String) = sys.env(HOME) + / + name + .jsonlines

def tempTable(name: String, json: String) = {
  val path = dataPath(name)
  new PrintWriter(path) { write(json); close }
  ctx.read.json(file:// + path).registerTempTable(name)
}

// Note test_another is missing category column
tempTable(test_one, {category : A, num : 5})
tempTable(test_another, {num : 5})

//  ++
//  |category|
//  ++
//  |   A|
//  |   5| 
//  ++
//
//  Instead, an error should have been generated due to incompatible schema
ctx.sql(select * from test_one union all select * from test_another).show

// Cleanup
new File(dataPath(test_one)).delete()
new File(dataPath(test_another)).delete()
{code}

At other times, when the schema are complex, Spark SQL produces a misleading 
error about an unresolved Union operator:

{code}
scala ctx.sql(select * from view_clicks
 | union all
 | select * from view_clicks_aug
 | )
15/08/11 02:40:25 INFO ParseDriver: Parsing command: select * from view_clicks
union all
select * from view_clicks_aug
15/08/11 02:40:25 INFO ParseDriver: Parse Completed
15/08/11 02:40:25 INFO HiveMetaStore: 0: get_table : db=default tbl=view_clicks
15/08/11 02:40:25 INFO audit: ugi=ubuntuip=unknown-ip-addr  
cmd=get_table : db=default tbl=view_clicks
15/08/11 02:40:25 INFO HiveMetaStore: 0: get_table : db=default tbl=view_clicks
15/08/11 02:40:25 INFO audit: ugi=ubuntuip=unknown-ip-addr  
cmd=get_table : db=default tbl=view_clicks
15/08/11 02:40:25 INFO HiveMetaStore: 0: get_table : db=default 
tbl=view_clicks_aug
15/08/11 02:40:25 INFO audit: ugi=ubuntuip=unknown-ip-addr  
cmd=get_table : db=default tbl=view_clicks_aug
15/08/11 02:40:25 INFO HiveMetaStore: 0: get_table : db=default 
tbl=view_clicks_aug
15/08/11 02:40:25 INFO audit: ugi=ubuntuip=unknown-ip-addr  
cmd=get_table : db=default tbl=view_clicks_aug
org.apache.spark.sql.AnalysisException: unresolved operator 'Union;
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:38)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:42)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:126)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:50)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:98)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:97)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:97)
at scala.collection.immutable.List.foreach(List.scala:318)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:97)
at 

[jira] [Commented] (SPARK-9625) SparkILoop creates sql context continuously, thousands of times

2015-08-08 Thread Simeon Simeonov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14663233#comment-14663233
 ] 

Simeon Simeonov commented on SPARK-9625:


[~sowen] I can reproduce this problem at will but only with a specific 
combination of code/data. I tried several times w/o success to create a small 
standalone reproducible example. The same code works w/o issues inside a 
spark-submit script which is why I think the problem has something to do with 
closure handling related to spark-shell.

 SparkILoop creates sql context continuously, thousands of times
 ---

 Key: SPARK-9625
 URL: https://issues.apache.org/jira/browse/SPARK-9625
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell, SQL
Affects Versions: 1.4.1
 Environment: Ubuntu on AWS
Reporter: Simeon Simeonov
  Labels: sql

 Occasionally but repeatably, based on the Spark SQL operations being run, 
 {{spark-shell}} gets into a funk where it attempts to create a sql context 
 over and over again as it is doing its work. Example output below:
 {code}
 15/08/05 03:04:12 INFO DAGScheduler: looking for newly runnable stages
 15/08/05 03:04:12 INFO DAGScheduler: running: Set()
 15/08/05 03:04:12 INFO DAGScheduler: waiting: Set(ShuffleMapStage 7, 
 ResultStage 8)
 15/08/05 03:04:12 INFO DAGScheduler: failed: Set()
 15/08/05 03:04:12 INFO DAGScheduler: Missing parents for ShuffleMapStage 7: 
 List()
 15/08/05 03:04:12 INFO DAGScheduler: Missing parents for ResultStage 8: 
 List(ShuffleMapStage 7)
 15/08/05 03:04:12 INFO DAGScheduler: Submitting ShuffleMapStage 7 
 (MapPartitionsRDD[49] at map at console:474), which is now runnable
 15/08/05 03:04:12 INFO MemoryStore: ensureFreeSpace(47840) called with 
 curMem=685306, maxMem=26671746908
 15/08/05 03:04:12 INFO MemoryStore: Block broadcast_12 stored as values in 
 memory (estimated size 46.7 KB, free 24.8 GB)
 15/08/05 03:04:12 INFO MemoryStore: ensureFreeSpace(15053) called with 
 curMem=733146, maxMem=26671746908
 15/08/05 03:04:12 INFO MemoryStore: Block broadcast_12_piece0 stored as bytes 
 in memory (estimated size 14.7 KB, free 24.8 GB)
 15/08/05 03:04:12 INFO BlockManagerInfo: Added broadcast_12_piece0 in memory 
 on localhost:39451 (size: 14.7 KB, free: 24.8 GB)
 15/08/05 03:04:12 INFO SparkContext: Created broadcast 12 from broadcast at 
 DAGScheduler.scala:874
 15/08/05 03:04:12 INFO DAGScheduler: Submitting 1 missing tasks from 
 ShuffleMapStage 7 (MapPartitionsRDD[49] at map at console:474)
 15/08/05 03:04:12 INFO TaskSchedulerImpl: Adding task set 7.0 with 1 tasks
 15/08/05 03:04:12 INFO TaskSetManager: Starting task 0.0 in stage 7.0 (TID 
 684, localhost, PROCESS_LOCAL, 1461 bytes)
 15/08/05 03:04:12 INFO Executor: Running task 0.0 in stage 7.0 (TID 684)
 15/08/05 03:04:12 INFO ShuffleBlockFetcherIterator: Getting 214 non-empty 
 blocks out of 214 blocks
 15/08/05 03:04:12 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches 
 in 1 ms
 15/08/05 03:04:12 INFO HiveContext: Initializing execution hive, version 
 0.13.1
 15/08/05 03:04:13 INFO HiveMetaStore: No user is added in admin role, since 
 config is empty
 15/08/05 03:04:13 INFO SessionState: No Tez session required at this point. 
 hive.execution.engine=mr.
 15/08/05 03:04:13 INFO SparkILoop: Created sql context (with Hive support)..
 SQL context available as sqlContext.
 15/08/05 03:04:13 INFO HiveContext: Initializing execution hive, version 
 0.13.1
 15/08/05 03:04:13 INFO SparkILoop: Created sql context (with Hive support)..
 SQL context available as sqlContext.
 15/08/05 03:04:13 INFO HiveContext: Initializing execution hive, version 
 0.13.1
 15/08/05 03:04:13 INFO SparkILoop: Created sql context (with Hive support)..
 SQL context available as sqlContext.
 15/08/05 03:04:13 INFO HiveContext: Initializing execution hive, version 
 0.13.1
 15/08/05 03:04:13 INFO SparkILoop: Created sql context (with Hive support)..
 SQL context available as sqlContext.
 15/08/05 03:04:13 INFO HiveContext: Initializing execution hive, version 
 0.13.1
 15/08/05 03:04:13 INFO SparkILoop: Created sql context (with Hive support)..
 SQL context available as sqlContext.
 15/08/05 03:04:13 INFO HiveContext: Initializing execution hive, version 
 0.13.1
 15/08/05 03:04:13 INFO SparkILoop: Created sql context (with Hive support)..
 SQL context available as sqlContext.
 15/08/05 03:04:13 INFO HiveContext: Initializing execution hive, version 
 0.13.1
 15/08/05 03:04:13 INFO SparkILoop: Created sql context (with Hive support)..
 SQL context available as sqlContext.
 15/08/05 03:04:13 INFO HiveContext: Initializing execution hive, version 
 0.13.1
 15/08/05 03:04:13 INFO SparkILoop: Created sql context (with Hive support)..
 SQL context available as sqlContext.
 15/08/05 03:04:13 INFO 

[jira] [Created] (SPARK-9761) Inconsistent metadata handling with ALTER TABLE

2015-08-08 Thread Simeon Simeonov (JIRA)
Simeon Simeonov created SPARK-9761:
--

 Summary: Inconsistent metadata handling with ALTER TABLE
 Key: SPARK-9761
 URL: https://issues.apache.org/jira/browse/SPARK-9761
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.1
 Environment: Ubuntu on AWS
Reporter: Simeon Simeonov


Schema changes made with {{ALTER TABLE}} are not shown in {{DESCRIBE TABLE}}. 
The table in question was created with {{HiveContext.read.json()}}.

Steps:

# {{alter table dimension_components add columns (z string);}} succeeds.
# {{describe dimension_components;}} does not show the new column, even after 
restarting spark-sql.
# A second {{alter table dimension_components add columns (z string);}} fails 
with RROR exec.DDLTask: org.apache.hadoop.hive.ql.metadata.HiveException: 
Duplicate column name: z

Full spark-sql output 
[here|https://gist.github.com/ssimeonov/d9af4b8bb76b9d7befde].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9762) ALTER TABLE cannot find column

2015-08-08 Thread Simeon Simeonov (JIRA)
Simeon Simeonov created SPARK-9762:
--

 Summary: ALTER TABLE cannot find column
 Key: SPARK-9762
 URL: https://issues.apache.org/jira/browse/SPARK-9762
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.1
 Environment: Ubuntu on AWS
Reporter: Simeon Simeonov


{{ALTER TABLE tbl CHANGE}} cannot find a column that {{DESCRIBE COLUMN}} lists. 

In the case of a table generated with {{HiveContext.read.json()}}, the output 
of {{DESCRIBE dimension_components}} is:

{code}
comp_config 
structadText:string,adTextLeft:string,background:string,brand:string,button_color:string,cta_side:string,cta_type:string,depth:string,fixed_under:string,light:string,mid_text:string,oneline:string,overhang:string,shine:string,style:string,style_secondary:string,style_small:string,type:string
comp_criteria   string
comp_data_model string
comp_dimensions 
structdata:string,integrations:arraystring,template:string,variation:bigint
comp_disabled   boolean
comp_id bigint
comp_path   string
comp_placementData  structmod:string
comp_slot_types arraystring
{code}

However, {{alter table dimension_components change comp_dimensions 
comp_dimensions 
structdata:string,integrations:arraystring,template:string,variation:bigint,z:string;}}
 fails with:

{code}
15/08/08 23:13:07 ERROR exec.DDLTask: 
org.apache.hadoop.hive.ql.metadata.HiveException: Invalid column reference 
comp_dimensions
at org.apache.hadoop.hive.ql.exec.DDLTask.alterTable(DDLTask.java:3584)
at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:312)
at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:153)
at 
org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:85)
at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1503)
at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1270)
at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1088)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:911)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:901)
at 
org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$runHive$1.apply(ClientWrapper.scala:345)
at 
org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$runHive$1.apply(ClientWrapper.scala:326)
at 
org.apache.spark.sql.hive.client.ClientWrapper.withHiveState(ClientWrapper.scala:155)
at 
org.apache.spark.sql.hive.client.ClientWrapper.runHive(ClientWrapper.scala:326)
at 
org.apache.spark.sql.hive.client.ClientWrapper.runSqlHive(ClientWrapper.scala:316)
at 
org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:473)
...
{code}

Full spark-sql output 
[here|https://gist.github.com/ssimeonov/636a25d6074a03aafa67].




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9762) ALTER TABLE cannot find column

2015-08-08 Thread Simeon Simeonov (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Simeon Simeonov updated SPARK-9762:
---
Description: 
{{ALTER TABLE tbl CHANGE}} cannot find a column that {{DESCRIBE COLUMN}} lists. 

In the case of a table generated with {{HiveContext.read.json()}}, the output 
of {{DESCRIBE dimension_components}} is:

{code}
comp_config 
structadText:string,adTextLeft:string,background:string,brand:string,button_color:string,cta_side:string,cta_type:string,depth:string,fixed_under:string,light:string,mid_text:string,oneline:string,overhang:string,shine:string,style:string,style_secondary:string,style_small:string,type:string
comp_criteria   string
comp_data_model string
comp_dimensions 
structdata:string,integrations:arraystring,template:string,variation:bigint
comp_disabled   boolean
comp_id bigint
comp_path   string
comp_placementData  structmod:string
comp_slot_types arraystring
{code}

However, {{alter table dimension_components change comp_dimensions 
comp_dimensions 
structdata:string,integrations:arraystring,template:string,variation:bigint,z:string;}}
 fails with:

{code}
15/08/08 23:13:07 ERROR exec.DDLTask: 
org.apache.hadoop.hive.ql.metadata.HiveException: Invalid column reference 
comp_dimensions
at org.apache.hadoop.hive.ql.exec.DDLTask.alterTable(DDLTask.java:3584)
at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:312)
at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:153)
at 
org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:85)
at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1503)
at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1270)
at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1088)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:911)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:901)
at 
org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$runHive$1.apply(ClientWrapper.scala:345)
at 
org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$runHive$1.apply(ClientWrapper.scala:326)
at 
org.apache.spark.sql.hive.client.ClientWrapper.withHiveState(ClientWrapper.scala:155)
at 
org.apache.spark.sql.hive.client.ClientWrapper.runHive(ClientWrapper.scala:326)
at 
org.apache.spark.sql.hive.client.ClientWrapper.runSqlHive(ClientWrapper.scala:316)
at 
org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:473)
...
{code}

Meanwhile, {{SHOW COLUMNS in dimension_components}} lists two columns: {{col}} 
(which does not exist in the table) and {{z}}, which was just added.

This suggests that DDL operations in Spark SQL use table metadata 
inconsistently.

Full spark-sql output 
[here|https://gist.github.com/ssimeonov/636a25d6074a03aafa67].


  was:
{{ALTER TABLE tbl CHANGE}} cannot find a column that {{DESCRIBE COLUMN}} lists. 

In the case of a table generated with {{HiveContext.read.json()}}, the output 
of {{DESCRIBE dimension_components}} is:

{code}
comp_config 
structadText:string,adTextLeft:string,background:string,brand:string,button_color:string,cta_side:string,cta_type:string,depth:string,fixed_under:string,light:string,mid_text:string,oneline:string,overhang:string,shine:string,style:string,style_secondary:string,style_small:string,type:string
comp_criteria   string
comp_data_model string
comp_dimensions 
structdata:string,integrations:arraystring,template:string,variation:bigint
comp_disabled   boolean
comp_id bigint
comp_path   string
comp_placementData  structmod:string
comp_slot_types arraystring
{code}

However, {{alter table dimension_components change comp_dimensions 
comp_dimensions 
structdata:string,integrations:arraystring,template:string,variation:bigint,z:string;}}
 fails with:

{code}
15/08/08 23:13:07 ERROR exec.DDLTask: 
org.apache.hadoop.hive.ql.metadata.HiveException: Invalid column reference 
comp_dimensions
at org.apache.hadoop.hive.ql.exec.DDLTask.alterTable(DDLTask.java:3584)
at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:312)
at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:153)
at 
org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:85)
at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1503)
at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1270)
at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1088)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:911)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:901)
at 
org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$runHive$1.apply(ClientWrapper.scala:345)
at 
org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$runHive$1.apply(ClientWrapper.scala:326)
at 

[jira] [Created] (SPARK-9764) Spark SQL uses table metadata inconsistently

2015-08-08 Thread Simeon Simeonov (JIRA)
Simeon Simeonov created SPARK-9764:
--

 Summary: Spark SQL uses table metadata inconsistently
 Key: SPARK-9764
 URL: https://issues.apache.org/jira/browse/SPARK-9764
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.1
 Environment: Ubuntu on AWS
Reporter: Simeon Simeonov


For the same table, {{DESCRIBE}} and {{SHOW COLUMNS}} produce different 
results. The former shows the correct column names. The latter always shows 
just a single column named {{col}}. This is true for any table created with 
{{HiveContext.read.json}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-9625) SparkILoop creates sql context continuously, thousands of times

2015-08-04 Thread Simeon Simeonov (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Simeon Simeonov closed SPARK-9625.
--
Resolution: Won't Fix

 SparkILoop creates sql context continuously, thousands of times
 ---

 Key: SPARK-9625
 URL: https://issues.apache.org/jira/browse/SPARK-9625
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell
Affects Versions: 1.4.1
 Environment: Ubuntu on AWS
Reporter: Simeon Simeonov
  Labels: sql

 Occasionally but repeatably, based on the Spark SQL operations being run, 
 {{spark-shell}} gets into a funk where it attempts to create a sql context 
 over and over again as it is doing its work. Example output below:
 {code}
 15/08/05 03:04:12 INFO DAGScheduler: looking for newly runnable stages
 15/08/05 03:04:12 INFO DAGScheduler: running: Set()
 15/08/05 03:04:12 INFO DAGScheduler: waiting: Set(ShuffleMapStage 7, 
 ResultStage 8)
 15/08/05 03:04:12 INFO DAGScheduler: failed: Set()
 15/08/05 03:04:12 INFO DAGScheduler: Missing parents for ShuffleMapStage 7: 
 List()
 15/08/05 03:04:12 INFO DAGScheduler: Missing parents for ResultStage 8: 
 List(ShuffleMapStage 7)
 15/08/05 03:04:12 INFO DAGScheduler: Submitting ShuffleMapStage 7 
 (MapPartitionsRDD[49] at map at console:474), which is now runnable
 15/08/05 03:04:12 INFO MemoryStore: ensureFreeSpace(47840) called with 
 curMem=685306, maxMem=26671746908
 15/08/05 03:04:12 INFO MemoryStore: Block broadcast_12 stored as values in 
 memory (estimated size 46.7 KB, free 24.8 GB)
 15/08/05 03:04:12 INFO MemoryStore: ensureFreeSpace(15053) called with 
 curMem=733146, maxMem=26671746908
 15/08/05 03:04:12 INFO MemoryStore: Block broadcast_12_piece0 stored as bytes 
 in memory (estimated size 14.7 KB, free 24.8 GB)
 15/08/05 03:04:12 INFO BlockManagerInfo: Added broadcast_12_piece0 in memory 
 on localhost:39451 (size: 14.7 KB, free: 24.8 GB)
 15/08/05 03:04:12 INFO SparkContext: Created broadcast 12 from broadcast at 
 DAGScheduler.scala:874
 15/08/05 03:04:12 INFO DAGScheduler: Submitting 1 missing tasks from 
 ShuffleMapStage 7 (MapPartitionsRDD[49] at map at console:474)
 15/08/05 03:04:12 INFO TaskSchedulerImpl: Adding task set 7.0 with 1 tasks
 15/08/05 03:04:12 INFO TaskSetManager: Starting task 0.0 in stage 7.0 (TID 
 684, localhost, PROCESS_LOCAL, 1461 bytes)
 15/08/05 03:04:12 INFO Executor: Running task 0.0 in stage 7.0 (TID 684)
 15/08/05 03:04:12 INFO ShuffleBlockFetcherIterator: Getting 214 non-empty 
 blocks out of 214 blocks
 15/08/05 03:04:12 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches 
 in 1 ms
 15/08/05 03:04:12 INFO HiveContext: Initializing execution hive, version 
 0.13.1
 15/08/05 03:04:13 INFO HiveMetaStore: No user is added in admin role, since 
 config is empty
 15/08/05 03:04:13 INFO SessionState: No Tez session required at this point. 
 hive.execution.engine=mr.
 15/08/05 03:04:13 INFO SparkILoop: Created sql context (with Hive support)..
 SQL context available as sqlContext.
 15/08/05 03:04:13 INFO HiveContext: Initializing execution hive, version 
 0.13.1
 15/08/05 03:04:13 INFO SparkILoop: Created sql context (with Hive support)..
 SQL context available as sqlContext.
 15/08/05 03:04:13 INFO HiveContext: Initializing execution hive, version 
 0.13.1
 15/08/05 03:04:13 INFO SparkILoop: Created sql context (with Hive support)..
 SQL context available as sqlContext.
 15/08/05 03:04:13 INFO HiveContext: Initializing execution hive, version 
 0.13.1
 15/08/05 03:04:13 INFO SparkILoop: Created sql context (with Hive support)..
 SQL context available as sqlContext.
 15/08/05 03:04:13 INFO HiveContext: Initializing execution hive, version 
 0.13.1
 15/08/05 03:04:13 INFO SparkILoop: Created sql context (with Hive support)..
 SQL context available as sqlContext.
 15/08/05 03:04:13 INFO HiveContext: Initializing execution hive, version 
 0.13.1
 15/08/05 03:04:13 INFO SparkILoop: Created sql context (with Hive support)..
 SQL context available as sqlContext.
 15/08/05 03:04:13 INFO HiveContext: Initializing execution hive, version 
 0.13.1
 15/08/05 03:04:13 INFO SparkILoop: Created sql context (with Hive support)..
 SQL context available as sqlContext.
 15/08/05 03:04:13 INFO HiveContext: Initializing execution hive, version 
 0.13.1
 15/08/05 03:04:13 INFO SparkILoop: Created sql context (with Hive support)..
 SQL context available as sqlContext.
 15/08/05 03:04:13 INFO HiveContext: Initializing execution hive, version 
 0.13.1
 15/08/05 03:04:13 INFO SparkILoop: Created sql context (with Hive support)..
 SQL context available as sqlContext.
 15/08/05 03:04:13 INFO HiveContext: Initializing execution hive, version 
 0.13.1
 15/08/05 03:04:13 INFO SparkILoop: Created sql context (with Hive support)..
 SQL context available as sqlContext.
 {code}
 In the 

[jira] [Commented] (SPARK-9625) SparkILoop creates sql context continuously, thousands of times

2015-08-04 Thread Simeon Simeonov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14654806#comment-14654806
 ] 

Simeon Simeonov commented on SPARK-9625:


This is a spark shell-specific issue seemingly related to closures  
serialization. All code in the shell is entered into a closure and that can 
affect serialization. 

 SparkILoop creates sql context continuously, thousands of times
 ---

 Key: SPARK-9625
 URL: https://issues.apache.org/jira/browse/SPARK-9625
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell
Affects Versions: 1.4.1
 Environment: Ubuntu on AWS
Reporter: Simeon Simeonov
  Labels: sql

 Occasionally but repeatably, based on the Spark SQL operations being run, 
 {{spark-shell}} gets into a funk where it attempts to create a sql context 
 over and over again as it is doing its work. Example output below:
 {code}
 15/08/05 03:04:12 INFO DAGScheduler: looking for newly runnable stages
 15/08/05 03:04:12 INFO DAGScheduler: running: Set()
 15/08/05 03:04:12 INFO DAGScheduler: waiting: Set(ShuffleMapStage 7, 
 ResultStage 8)
 15/08/05 03:04:12 INFO DAGScheduler: failed: Set()
 15/08/05 03:04:12 INFO DAGScheduler: Missing parents for ShuffleMapStage 7: 
 List()
 15/08/05 03:04:12 INFO DAGScheduler: Missing parents for ResultStage 8: 
 List(ShuffleMapStage 7)
 15/08/05 03:04:12 INFO DAGScheduler: Submitting ShuffleMapStage 7 
 (MapPartitionsRDD[49] at map at console:474), which is now runnable
 15/08/05 03:04:12 INFO MemoryStore: ensureFreeSpace(47840) called with 
 curMem=685306, maxMem=26671746908
 15/08/05 03:04:12 INFO MemoryStore: Block broadcast_12 stored as values in 
 memory (estimated size 46.7 KB, free 24.8 GB)
 15/08/05 03:04:12 INFO MemoryStore: ensureFreeSpace(15053) called with 
 curMem=733146, maxMem=26671746908
 15/08/05 03:04:12 INFO MemoryStore: Block broadcast_12_piece0 stored as bytes 
 in memory (estimated size 14.7 KB, free 24.8 GB)
 15/08/05 03:04:12 INFO BlockManagerInfo: Added broadcast_12_piece0 in memory 
 on localhost:39451 (size: 14.7 KB, free: 24.8 GB)
 15/08/05 03:04:12 INFO SparkContext: Created broadcast 12 from broadcast at 
 DAGScheduler.scala:874
 15/08/05 03:04:12 INFO DAGScheduler: Submitting 1 missing tasks from 
 ShuffleMapStage 7 (MapPartitionsRDD[49] at map at console:474)
 15/08/05 03:04:12 INFO TaskSchedulerImpl: Adding task set 7.0 with 1 tasks
 15/08/05 03:04:12 INFO TaskSetManager: Starting task 0.0 in stage 7.0 (TID 
 684, localhost, PROCESS_LOCAL, 1461 bytes)
 15/08/05 03:04:12 INFO Executor: Running task 0.0 in stage 7.0 (TID 684)
 15/08/05 03:04:12 INFO ShuffleBlockFetcherIterator: Getting 214 non-empty 
 blocks out of 214 blocks
 15/08/05 03:04:12 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches 
 in 1 ms
 15/08/05 03:04:12 INFO HiveContext: Initializing execution hive, version 
 0.13.1
 15/08/05 03:04:13 INFO HiveMetaStore: No user is added in admin role, since 
 config is empty
 15/08/05 03:04:13 INFO SessionState: No Tez session required at this point. 
 hive.execution.engine=mr.
 15/08/05 03:04:13 INFO SparkILoop: Created sql context (with Hive support)..
 SQL context available as sqlContext.
 15/08/05 03:04:13 INFO HiveContext: Initializing execution hive, version 
 0.13.1
 15/08/05 03:04:13 INFO SparkILoop: Created sql context (with Hive support)..
 SQL context available as sqlContext.
 15/08/05 03:04:13 INFO HiveContext: Initializing execution hive, version 
 0.13.1
 15/08/05 03:04:13 INFO SparkILoop: Created sql context (with Hive support)..
 SQL context available as sqlContext.
 15/08/05 03:04:13 INFO HiveContext: Initializing execution hive, version 
 0.13.1
 15/08/05 03:04:13 INFO SparkILoop: Created sql context (with Hive support)..
 SQL context available as sqlContext.
 15/08/05 03:04:13 INFO HiveContext: Initializing execution hive, version 
 0.13.1
 15/08/05 03:04:13 INFO SparkILoop: Created sql context (with Hive support)..
 SQL context available as sqlContext.
 15/08/05 03:04:13 INFO HiveContext: Initializing execution hive, version 
 0.13.1
 15/08/05 03:04:13 INFO SparkILoop: Created sql context (with Hive support)..
 SQL context available as sqlContext.
 15/08/05 03:04:13 INFO HiveContext: Initializing execution hive, version 
 0.13.1
 15/08/05 03:04:13 INFO SparkILoop: Created sql context (with Hive support)..
 SQL context available as sqlContext.
 15/08/05 03:04:13 INFO HiveContext: Initializing execution hive, version 
 0.13.1
 15/08/05 03:04:13 INFO SparkILoop: Created sql context (with Hive support)..
 SQL context available as sqlContext.
 15/08/05 03:04:13 INFO HiveContext: Initializing execution hive, version 
 0.13.1
 15/08/05 03:04:13 INFO SparkILoop: Created sql context (with Hive support)..
 SQL context available as sqlContext.
 15/08/05 

  1   2   >