date:20160402

[jira] [Commented] (SPARK-14335) Describe function command returns wrong output because some of built-in functions are not in function registry.

2016-04-02 Thread Yong Tang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15222761#comment-15222761
 ] 

Yong Tang commented on SPARK-14335:
---

I can work on this one. Will provide a pull request shortly.

> Describe function command returns wrong output because some of built-in 
> functions are not in function registry.
> ---
>
> Key: SPARK-14335
> URL: https://issues.apache.org/jira/browse/SPARK-14335
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>Priority: Minor
>
> {code}
> %sql describe function `and`
> unction: and
> Class: org.apache.hadoop.hive.ql.udf.generic.GenericUDFOPAnd
> Usage: a and b - Logical and
> {code}
> The output still shows Hive's function because {{and}} is not in our 
> FunctionRegistry. Here is a list of such kind of commands
> {code}
> -
> !
> !=
> *
> /
> &
> %
> ^
> +
> <
> <=
> <=>
> <>
> =
> ==
> >
> >=
> |
> ~
> and
> between
> case
> in
> like
> not
> or
> rlike
> when
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14335) Describe function command returns wrong output because some of built-in functions are not in function registry.

2016-04-02 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14335:


Assignee: Apache Spark

> Describe function command returns wrong output because some of built-in 
> functions are not in function registry.
> ---
>
> Key: SPARK-14335
> URL: https://issues.apache.org/jira/browse/SPARK-14335
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Apache Spark
>Priority: Minor
>
> {code}
> %sql describe function `and`
> unction: and
> Class: org.apache.hadoop.hive.ql.udf.generic.GenericUDFOPAnd
> Usage: a and b - Logical and
> {code}
> The output still shows Hive's function because {{and}} is not in our 
> FunctionRegistry. Here is a list of such kind of commands
> {code}
> -
> !
> !=
> *
> /
> &
> %
> ^
> +
> <
> <=
> <=>
> <>
> =
> ==
> >
> >=
> |
> ~
> and
> between
> case
> in
> like
> not
> or
> rlike
> when
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14335) Describe function command returns wrong output because some of built-in functions are not in function registry.

2016-04-02 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15222767#comment-15222767
 ] 

Apache Spark commented on SPARK-14335:
--

User 'yongtang' has created a pull request for this issue:
https://github.com/apache/spark/pull/12128

> Describe function command returns wrong output because some of built-in 
> functions are not in function registry.
> ---
>
> Key: SPARK-14335
> URL: https://issues.apache.org/jira/browse/SPARK-14335
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>Priority: Minor
>
> {code}
> %sql describe function `and`
> unction: and
> Class: org.apache.hadoop.hive.ql.udf.generic.GenericUDFOPAnd
> Usage: a and b - Logical and
> {code}
> The output still shows Hive's function because {{and}} is not in our 
> FunctionRegistry. Here is a list of such kind of commands
> {code}
> -
> !
> !=
> *
> /
> &
> %
> ^
> +
> <
> <=
> <=>
> <>
> =
> ==
> >
> >=
> |
> ~
> and
> between
> case
> in
> like
> not
> or
> rlike
> when
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14335) Describe function command returns wrong output because some of built-in functions are not in function registry.

2016-04-02 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14335:


Assignee: (was: Apache Spark)

> Describe function command returns wrong output because some of built-in 
> functions are not in function registry.
> ---
>
> Key: SPARK-14335
> URL: https://issues.apache.org/jira/browse/SPARK-14335
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>Priority: Minor
>
> {code}
> %sql describe function `and`
> unction: and
> Class: org.apache.hadoop.hive.ql.udf.generic.GenericUDFOPAnd
> Usage: a and b - Logical and
> {code}
> The output still shows Hive's function because {{and}} is not in our 
> FunctionRegistry. Here is a list of such kind of commands
> {code}
> -
> !
> !=
> *
> /
> &
> %
> ^
> +
> <
> <=
> <=>
> <>
> =
> ==
> >
> >=
> |
> ~
> and
> between
> case
> in
> like
> not
> or
> rlike
> when
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14103) Python DataFrame CSV load on large file is writing to console in Ipython

2016-04-02 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15222792#comment-15222792
 ] 

Hyukjin Kwon commented on SPARK-14103:
--

[~shubhanshumis...@gmail.com] Right, it looks like an issue in Univocity Parser.

I could reproduce this error with the data below:

{code}
"a"bccc ddd
{code}

and code below:

{code}
val path = "temp.tsv"
sqlContext.read
.format("csv")
.option("maxCharsPerColumn", "4")
.option("delimiter", "\t")
.load(path)
{code}

It looks Univocity parser gets confused when it meets {{quote}} character 
during parsing a value and the value does not end with the character. It just 
treats the entire rows and values as a quoted value  as a value afterward when 
this happens.

So, it looks your data has such rows, for example below:
{code}
7C0E15CD"I did it my way": moving away from the tyranny of turn-by-turn 
pedestrian navigation   i did it my way moving away from the tyranny of turn by 
turn pedestrian navigation  20102010/09/07  10.1145/1851600.1851660 
international conference on human computer interaction  interact
4333105818871
{code}

All the data after {{"I did it my way}} was being treated as a quoted value.


[~sowen] Actually, I have been a bit questionable if Spark should use Univocity 
parse. It looks it is generally true that this library itself is faster then 
Apache CSV parser but it brought complexity of codes and there are pretty messy 
additional logics to use Univocity for now. Also, it became pretty difficult to 
figure out such issues.

I am thinking about changing Univocity to Apahce CSV parser after performance 
tests. Do you think this makes sense?

> Python DataFrame CSV load on large file is writing to console in Ipython
> 
>
> Key: SPARK-14103
> URL: https://issues.apache.org/jira/browse/SPARK-14103
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
> Environment: Ubuntu, Python 2.7.11, Anaconda 2.5.0, Spark from Master 
> branch
>Reporter: Shubhanshu Mishra
>  Labels: csv, csvparser, dataframe, pyspark
>
> I am using the spark from the master branch and when I run the following 
> command on a large tab separated file then I get the contents of the file 
> being written to the stderr
> {code}
> df = sqlContext.read.load("temp.txt", format="csv", header="false", 
> inferSchema="true", delimiter="\t")
> {code}
> Here is a sample of output:
> {code}
> ^M[Stage 1:>  (0 + 2) 
> / 2]16/03/23 14:01:02 ERROR Executor: Exception in task 1.0 in stage 1.0 (TID 
> 2)
> com.univocity.parsers.common.TextParsingException: Error processing input: 
> Length of parsed input (101) exceeds the maximum number of characters 
> defined in your parser settings (100). Identified line separator 
> characters in the parsed content. This may be the cause of the error. The 
> line separator in your parser settings is set to '\n'. Parsed content:
> Privacy-shake",: a haptic interface for managing privacy settings in 
> mobile location sharing applications   privacy shake a haptic interface 
> for managing privacy settings in mobile location sharing applications  2010   
>  2010/09/07  international conference on human computer 
> interaction  interact4333105819371[\n]
> 3D4F6CA1Between the Profiles: Another such Bias. Technology 
> Acceptance Studies on Social Network Services   between the profiles 
> another such bias technology acceptance studies on social network services 
> 20152015/08/02  10.1007/978-3-319-21383-5_12international 
> conference on human-computer interaction  interact43331058
> 19502[\n]
> ...
> .
> web snippets20082008/05/04  10.1007/978-3-642-01344-7_13
> international conference on web information systems and technologies
> webist  44F2980219489
> 06FA3FFAInteractive 3D User Interfaces for Neuroanatomy Exploration   
>   interactive 3d user interfaces for neuroanatomy exploration 2009
> internationa]
> at 
> com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:241)
> at 
> com.univocity.parsers.common.AbstractParser.parseNext(AbstractParser.java:356)
> at 
> org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:137)
> at 
> org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:120)
> at scala.collection.Iterator$class.foreach(Iterator.scala:742)
> at 
> org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.foreach(CSVParser.scala:120)

[jira] [Comment Edited] (SPARK-14103) Python DataFrame CSV load on large file is writing to console in Ipython

2016-04-02 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15222792#comment-15222792
 ] 

Hyukjin Kwon edited comment on SPARK-14103 at 4/2/16 8:34 AM:
--

[~shubhanshumis...@gmail.com] Right, it looks like an issue in Univocity Parser.

I could reproduce this error with the data below:

{code}
"a"bccc ddd
{code}

and code below:

{code}
val path = "temp.tsv"
sqlContext.read
  .format("csv")
  .option("maxCharsPerColumn", "4")
  .option("delimiter", "\t")
  .load(path)
{code}

It looks Univocity parser gets confused when it meets {{quote}} character 
during parsing a value and the value does not end with the character. It just 
treats the entire rows and values as a quoted value  as a value afterward when 
this happens.

So, it looks your data has such rows, for example below:
{code}
7C0E15CD"I did it my way": moving away from the tyranny of turn-by-turn 
pedestrian navigation   i did it my way moving away from the tyranny of turn by 
turn pedestrian navigation  20102010/09/07  10.1145/1851600.1851660 
international conference on human computer interaction  interact
4333105818871
{code}

All the data after {{"I did it my way}} was being treated as a quoted value.


[~sowen] Actually, I have been a bit questionable if Spark should use Univocity 
parse. It looks it is generally true that this library itself is faster then 
Apache CSV parser but it brought complexity of codes and there are pretty messy 
additional logics to use Univocity for now. Also, it became pretty difficult to 
figure out such issues.

I am thinking about changing Univocity to Apahce CSV parser after performance 
tests. Do you think this makes sense?


was (Author: hyukjin.kwon):
[~shubhanshumis...@gmail.com] Right, it looks like an issue in Univocity Parser.

I could reproduce this error with the data below:

{code}
"a"bccc ddd
{code}

and code below:

{code}
val path = "temp.tsv"
sqlContext.read
.format("csv")
.option("maxCharsPerColumn", "4")
.option("delimiter", "\t")
.load(path)
{code}

It looks Univocity parser gets confused when it meets {{quote}} character 
during parsing a value and the value does not end with the character. It just 
treats the entire rows and values as a quoted value  as a value afterward when 
this happens.

So, it looks your data has such rows, for example below:
{code}
7C0E15CD"I did it my way": moving away from the tyranny of turn-by-turn 
pedestrian navigation   i did it my way moving away from the tyranny of turn by 
turn pedestrian navigation  20102010/09/07  10.1145/1851600.1851660 
international conference on human computer interaction  interact
4333105818871
{code}

All the data after {{"I did it my way}} was being treated as a quoted value.


[~sowen] Actually, I have been a bit questionable if Spark should use Univocity 
parse. It looks it is generally true that this library itself is faster then 
Apache CSV parser but it brought complexity of codes and there are pretty messy 
additional logics to use Univocity for now. Also, it became pretty difficult to 
figure out such issues.

I am thinking about changing Univocity to Apahce CSV parser after performance 
tests. Do you think this makes sense?

> Python DataFrame CSV load on large file is writing to console in Ipython
> 
>
> Key: SPARK-14103
> URL: https://issues.apache.org/jira/browse/SPARK-14103
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
> Environment: Ubuntu, Python 2.7.11, Anaconda 2.5.0, Spark from Master 
> branch
>Reporter: Shubhanshu Mishra
>  Labels: csv, csvparser, dataframe, pyspark
>
> I am using the spark from the master branch and when I run the following 
> command on a large tab separated file then I get the contents of the file 
> being written to the stderr
> {code}
> df = sqlContext.read.load("temp.txt", format="csv", header="false", 
> inferSchema="true", delimiter="\t")
> {code}
> Here is a sample of output:
> {code}
> ^M[Stage 1:>  (0 + 2) 
> / 2]16/03/23 14:01:02 ERROR Executor: Exception in task 1.0 in stage 1.0 (TID 
> 2)
> com.univocity.parsers.common.TextParsingException: Error processing input: 
> Length of parsed input (101) exceeds the maximum number of characters 
> defined in your parser settings (100). Identified line separator 
> characters in the parsed content. This may be the cause of the error. The 
> line separator in your parser settings is set to '\n'. Parsed content:
> Privacy-shake",: a haptic interface for managing privacy settings in 
> mobile location sharing applications   privacy shake a

[jira] [Comment Edited] (SPARK-14103) Python DataFrame CSV load on large file is writing to console in Ipython

2016-04-02 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15222792#comment-15222792
 ] 

Hyukjin Kwon edited comment on SPARK-14103 at 4/2/16 8:35 AM:
--

[~shubhanshumis...@gmail.com] Right, it looks like an issue in Univocity Parser.

I could reproduce this error with the data below:

{code}
"a"bccc ddd
{code}

and code below:

{code}
val path = "temp.tsv"
sqlContext.read
  .format("csv")
  .option("maxCharsPerColumn", "4")
  .option("delimiter", "\t")
  .load(path)
{code}

It looks Univocity parser gets confused when it meets {{quote}} character 
during parsing a value and the value does not end with the character. It just 
treats the entire rows and values as a quoted value  as a value afterward when 
this happens.

So, it looks your data has such rows, for example below:
{code}
7C0E15CD"I did it my way": moving away from the tyranny of turn-by-turn 
pedestrian navigation   i did it my way moving away from the tyranny of turn by 
turn pedestrian navigation  20102010/09/07  10.1145/1851600.1851660 
international conference on human computer interaction  interact
4333105818871
{code}

All the data after {{"I did it my way}} was being treated as a quoted value.


[~sowen] Actually, I have been a bit questionable for the use of Univocity 
parser. It looks it is generally true that this library itself is faster then 
Apache CSV parser but it brought complexity of codes and there are pretty messy 
additional logics to use Univocity for now. Also, it became pretty difficult to 
figure out such issues.

I am thinking about changing Univocity to Apahce CSV parser after performance 
tests. Do you think this makes sense?


was (Author: hyukjin.kwon):
[~shubhanshumis...@gmail.com] Right, it looks like an issue in Univocity Parser.

I could reproduce this error with the data below:

{code}
"a"bccc ddd
{code}

and code below:

{code}
val path = "temp.tsv"
sqlContext.read
  .format("csv")
  .option("maxCharsPerColumn", "4")
  .option("delimiter", "\t")
  .load(path)
{code}

It looks Univocity parser gets confused when it meets {{quote}} character 
during parsing a value and the value does not end with the character. It just 
treats the entire rows and values as a quoted value  as a value afterward when 
this happens.

So, it looks your data has such rows, for example below:
{code}
7C0E15CD"I did it my way": moving away from the tyranny of turn-by-turn 
pedestrian navigation   i did it my way moving away from the tyranny of turn by 
turn pedestrian navigation  20102010/09/07  10.1145/1851600.1851660 
international conference on human computer interaction  interact
4333105818871
{code}

All the data after {{"I did it my way}} was being treated as a quoted value.


[~sowen] Actually, I have been a bit questionable if Spark should use Univocity 
parse. It looks it is generally true that this library itself is faster then 
Apache CSV parser but it brought complexity of codes and there are pretty messy 
additional logics to use Univocity for now. Also, it became pretty difficult to 
figure out such issues.

I am thinking about changing Univocity to Apahce CSV parser after performance 
tests. Do you think this makes sense?

> Python DataFrame CSV load on large file is writing to console in Ipython
> 
>
> Key: SPARK-14103
> URL: https://issues.apache.org/jira/browse/SPARK-14103
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
> Environment: Ubuntu, Python 2.7.11, Anaconda 2.5.0, Spark from Master 
> branch
>Reporter: Shubhanshu Mishra
>  Labels: csv, csvparser, dataframe, pyspark
>
> I am using the spark from the master branch and when I run the following 
> command on a large tab separated file then I get the contents of the file 
> being written to the stderr
> {code}
> df = sqlContext.read.load("temp.txt", format="csv", header="false", 
> inferSchema="true", delimiter="\t")
> {code}
> Here is a sample of output:
> {code}
> ^M[Stage 1:>  (0 + 2) 
> / 2]16/03/23 14:01:02 ERROR Executor: Exception in task 1.0 in stage 1.0 (TID 
> 2)
> com.univocity.parsers.common.TextParsingException: Error processing input: 
> Length of parsed input (101) exceeds the maximum number of characters 
> defined in your parser settings (100). Identified line separator 
> characters in the parsed content. This may be the cause of the error. The 
> line separator in your parser settings is set to '\n'. Parsed content:
> Privacy-shake",: a haptic interface for managing privacy settings in 
> mobile location sharing applications   privacy shake a haptic inter

[jira] [Created] (SPARK-14342) Remove straggler references to Tachyon

2016-04-02 Thread Liwei Lin (JIRA)

Liwei Lin created SPARK-14342:
-

 Summary: Remove straggler references to Tachyon
 Key: SPARK-14342
 URL: https://issues.apache.org/jira/browse/SPARK-14342
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, Spark Core, Tests
Affects Versions: 2.0.0
Reporter: Liwei Lin
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14342) Remove straggler references to Tachyon

2016-04-02 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14342:


Assignee: Apache Spark

> Remove straggler references to Tachyon
> --
>
> Key: SPARK-14342
> URL: https://issues.apache.org/jira/browse/SPARK-14342
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, Spark Core, Tests
>Affects Versions: 2.0.0
>Reporter: Liwei Lin
>Assignee: Apache Spark
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14342) Remove straggler references to Tachyon

2016-04-02 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15222795#comment-15222795
 ] 

Apache Spark commented on SPARK-14342:
--

User 'lw-lin' has created a pull request for this issue:
https://github.com/apache/spark/pull/12129

> Remove straggler references to Tachyon
> --
>
> Key: SPARK-14342
> URL: https://issues.apache.org/jira/browse/SPARK-14342
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, Spark Core, Tests
>Affects Versions: 2.0.0
>Reporter: Liwei Lin
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14342) Remove straggler references to Tachyon

2016-04-02 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14342:


Assignee: (was: Apache Spark)

> Remove straggler references to Tachyon
> --
>
> Key: SPARK-14342
> URL: https://issues.apache.org/jira/browse/SPARK-14342
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, Spark Core, Tests
>Affects Versions: 2.0.0
>Reporter: Liwei Lin
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14207) Transformer for splitting a Vector/Array column into individual columns

2016-04-02 Thread yuhao yang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15222798#comment-15222798
 ] 

yuhao yang commented on SPARK-14207:


having some problem in transformSchema. Without the actual data, it's hard to 
get how many columns should be added as output columns from just the schema.

> Transformer for splitting a Vector/Array column into individual columns
> ---
>
> Key: SPARK-14207
> URL: https://issues.apache.org/jira/browse/SPARK-14207
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> Use case: Given feature vector column of type {{Vector}} or 
> {{Array[Double]}}, split it into one column per feature.  Each column should 
> be labeled with the original feature name if that is available from the 
> metadata.
> What shall we call this Transformer?  Maybe VectorSplitter or VectorToColumns?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-14343) Dataframe operations on a partitioned dataset (using partition discovery) return invalid results

2016-04-02 Thread Jurriaan Pruis (JIRA)

Jurriaan Pruis created SPARK-14343:
--

 Summary: Dataframe operations on a partitioned dataset (using 
partition discovery) return invalid results
 Key: SPARK-14343
 URL: https://issues.apache.org/jira/browse/SPARK-14343
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.6.1
 Environment: Mac OS X 10.11.4
Reporter: Jurriaan Pruis


When reading a dataset using {{sqlContext.read.text()}} queries on the 
partitioned column return invalid results.

h2. How to reproduce:

h3. Generate datasets
{code:title=repro.sh}
#!/bin/sh

mkdir -p dataset/year=2014
mkdir -p dataset/year=2015

echo "data from 2014" > dataset/year=2014/part01.txt
echo "data from 2015" > dataset/year=2015/part01.txt
{code}

{code:title=repro2.sh}
#!/bin/sh

mkdir -p dataset2/month=june
mkdir -p dataset2/month=july

echo "data from june" > dataset2/month=june/part01.txt
echo "data from july" > dataset2/month=july/part01.txt
{code}

h3. using first dataset
{code:none}
>>> df = sqlContext.read.text('dataset')
...
>>> df
DataFrame[value: string, year: int]
>>> df.show()
+--++
| value|year|
+--++
|data from 2014|2014|
|data from 2015|2015|
+--++
>>> df.select('year').show()
++
|year|
++
|  14|
|  14|
++
{code}

This is clearly wrong. Seems like it returns the length of the value column?

h3. using second dataset

With another dataset it looks like this:
{code:none}
>>> df = sqlContext.read.text('dataset2')
>>> df
DataFrame[value: string, month: string]
>>> df.show()
+--+-+
| value|month|
+--+-+
|data from june| june|
|data from july| july|
+--+-+
>>> df.select('month').show()
+--+
| month|
+--+
|data from june|
|data from july|
+--+
{code}

Here it returns the value of the value column instead of the month partition.

h3. Workaround

If I convert the dataframe to an RDD and back to a DataFrame I get the 
following result (which is the expected behaviour):
{code:none}
>>> df.rdd.toDF().select('month').show()
+-+
|month|
+-+
| june|
| july|
+-+
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14342) Remove straggler references to Tachyon

2016-04-02 Thread Liwei Lin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liwei Lin updated SPARK-14342:
--
Issue Type: Sub-task  (was: Improvement)
Parent: SPARK-11806

> Remove straggler references to Tachyon
> --
>
> Key: SPARK-14342
> URL: https://issues.apache.org/jira/browse/SPARK-14342
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, Spark Core, Tests
>Affects Versions: 2.0.0
>Reporter: Liwei Lin
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-14344) saveAsParquetFile creates _metafile even when disabled

2016-04-02 Thread Kashish Jain (JIRA)

Kashish Jain created SPARK-14344:


 Summary: saveAsParquetFile creates _metafile even when disabled
 Key: SPARK-14344
 URL: https://issues.apache.org/jira/browse/SPARK-14344
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.1
Reporter: Kashish Jain
Priority: Minor
 Fix For: 1.2.3, 1.5.1, 1.3.1, 1.3.0, 1.2.2, 1.2.1


Specifying the property "spark.hadoop.parquet.enable.summary-metadata false" in 
spark.property does not prevent the creation of _metadata file in case of 
rdd.saveAsParquetFile("TableName")





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14103) Python DataFrame CSV load on large file is writing to console in Ipython

2016-04-02 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15222879#comment-15222879
 ] 

Hyukjin Kwon commented on SPARK-14103:
--

cc [~falaki] [~r...@databricks.com]

> Python DataFrame CSV load on large file is writing to console in Ipython
> 
>
> Key: SPARK-14103
> URL: https://issues.apache.org/jira/browse/SPARK-14103
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
> Environment: Ubuntu, Python 2.7.11, Anaconda 2.5.0, Spark from Master 
> branch
>Reporter: Shubhanshu Mishra
>  Labels: csv, csvparser, dataframe, pyspark
>
> I am using the spark from the master branch and when I run the following 
> command on a large tab separated file then I get the contents of the file 
> being written to the stderr
> {code}
> df = sqlContext.read.load("temp.txt", format="csv", header="false", 
> inferSchema="true", delimiter="\t")
> {code}
> Here is a sample of output:
> {code}
> ^M[Stage 1:>  (0 + 2) 
> / 2]16/03/23 14:01:02 ERROR Executor: Exception in task 1.0 in stage 1.0 (TID 
> 2)
> com.univocity.parsers.common.TextParsingException: Error processing input: 
> Length of parsed input (101) exceeds the maximum number of characters 
> defined in your parser settings (100). Identified line separator 
> characters in the parsed content. This may be the cause of the error. The 
> line separator in your parser settings is set to '\n'. Parsed content:
> Privacy-shake",: a haptic interface for managing privacy settings in 
> mobile location sharing applications   privacy shake a haptic interface 
> for managing privacy settings in mobile location sharing applications  2010   
>  2010/09/07  international conference on human computer 
> interaction  interact4333105819371[\n]
> 3D4F6CA1Between the Profiles: Another such Bias. Technology 
> Acceptance Studies on Social Network Services   between the profiles 
> another such bias technology acceptance studies on social network services 
> 20152015/08/02  10.1007/978-3-319-21383-5_12international 
> conference on human-computer interaction  interact43331058
> 19502[\n]
> ...
> .
> web snippets20082008/05/04  10.1007/978-3-642-01344-7_13
> international conference on web information systems and technologies
> webist  44F2980219489
> 06FA3FFAInteractive 3D User Interfaces for Neuroanatomy Exploration   
>   interactive 3d user interfaces for neuroanatomy exploration 2009
> internationa]
> at 
> com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:241)
> at 
> com.univocity.parsers.common.AbstractParser.parseNext(AbstractParser.java:356)
> at 
> org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:137)
> at 
> org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:120)
> at scala.collection.Iterator$class.foreach(Iterator.scala:742)
> at 
> org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.foreach(CSVParser.scala:120)
> at 
> scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:155)
> at 
> org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.foldLeft(CSVParser.scala:120)
> at 
> scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:212)
> at 
> org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.aggregate(CSVParser.scala:120)
> at 
> org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$22.apply(RDD.scala:1058)
> at 
> org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$22.apply(RDD.scala:1058)
> at 
> org.apache.spark.SparkContext$$anonfun$35.apply(SparkContext.scala:1827)
> at 
> org.apache.spark.SparkContext$$anonfun$35.apply(SparkContext.scala:1827)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:69)
> at org.apache.spark.scheduler.Task.run(Task.scala:82)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:231)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.ArrayIndexOutOfBoundsException
> 16/03/23 14:01:03 ERROR TaskSetManager: Task 0 in stage 1.0 failed 1 times; 
> aborting job
> ^M[Stage 1:>  (0 + 1) 
> / 2]
> {code}
> For a small sample (<10,00

[jira] [Comment Edited] (SPARK-14103) Python DataFrame CSV load on large file is writing to console in Ipython

2016-04-02 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15222792#comment-15222792
 ] 

Hyukjin Kwon edited comment on SPARK-14103 at 4/2/16 1:51 PM:
--

[~shubhanshumis...@gmail.com] Right, it looks like an issue in Univocity Parser.

I could reproduce this error with the data below:

{code}
"a"bccc ddd
{code}

and code below:

{code}
val path = "temp.tsv"
sqlContext.read
  .format("csv")
  .option("maxCharsPerColumn", "4")
  .option("delimiter", "\t")
  .load(path)
{code}

It looks Univocity parser gets confused when it meets {{quote}} character 
during parsing a value and the value does not end with the character. It just 
treats the entire rows and values as a quoted value  as a value afterward when 
this happens.

So, it looks your data has such rows, for example below:
{code}
7C0E15CD"I did it my way": moving away from the tyranny of turn-by-turn 
pedestrian navigation   i did it my way moving away from the tyranny of turn by 
turn pedestrian navigation  20102010/09/07  10.1145/1851600.1851660 
international conference on human computer interaction  interact
4333105818871
{code}

All the data after {{"I did it my way}} was being treated as a quoted value.


[~sowen] Actually, for me it has been a bit questionable for the use of 
Univocity parser. It looks it is generally true that this library itself is 
faster then Apache CSV parser but it brought complexity of codes and there are 
pretty messy additional logics to use Univocity for now. Also, it became pretty 
difficult to figure out such issues.

I am thinking about changing Univocity to Apahce CSV parser after performance 
tests. Do you think this makes sense?


was (Author: hyukjin.kwon):
[~shubhanshumis...@gmail.com] Right, it looks like an issue in Univocity Parser.

I could reproduce this error with the data below:

{code}
"a"bccc ddd
{code}

and code below:

{code}
val path = "temp.tsv"
sqlContext.read
  .format("csv")
  .option("maxCharsPerColumn", "4")
  .option("delimiter", "\t")
  .load(path)
{code}

It looks Univocity parser gets confused when it meets {{quote}} character 
during parsing a value and the value does not end with the character. It just 
treats the entire rows and values as a quoted value  as a value afterward when 
this happens.

So, it looks your data has such rows, for example below:
{code}
7C0E15CD"I did it my way": moving away from the tyranny of turn-by-turn 
pedestrian navigation   i did it my way moving away from the tyranny of turn by 
turn pedestrian navigation  20102010/09/07  10.1145/1851600.1851660 
international conference on human computer interaction  interact
4333105818871
{code}

All the data after {{"I did it my way}} was being treated as a quoted value.


[~sowen] Actually, I have been a bit questionable for the use of Univocity 
parser. It looks it is generally true that this library itself is faster then 
Apache CSV parser but it brought complexity of codes and there are pretty messy 
additional logics to use Univocity for now. Also, it became pretty difficult to 
figure out such issues.

I am thinking about changing Univocity to Apahce CSV parser after performance 
tests. Do you think this makes sense?

> Python DataFrame CSV load on large file is writing to console in Ipython
> 
>
> Key: SPARK-14103
> URL: https://issues.apache.org/jira/browse/SPARK-14103
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
> Environment: Ubuntu, Python 2.7.11, Anaconda 2.5.0, Spark from Master 
> branch
>Reporter: Shubhanshu Mishra
>  Labels: csv, csvparser, dataframe, pyspark
>
> I am using the spark from the master branch and when I run the following 
> command on a large tab separated file then I get the contents of the file 
> being written to the stderr
> {code}
> df = sqlContext.read.load("temp.txt", format="csv", header="false", 
> inferSchema="true", delimiter="\t")
> {code}
> Here is a sample of output:
> {code}
> ^M[Stage 1:>  (0 + 2) 
> / 2]16/03/23 14:01:02 ERROR Executor: Exception in task 1.0 in stage 1.0 (TID 
> 2)
> com.univocity.parsers.common.TextParsingException: Error processing input: 
> Length of parsed input (101) exceeds the maximum number of characters 
> defined in your parser settings (100). Identified line separator 
> characters in the parsed content. This may be the cause of the error. The 
> line separator in your parser settings is set to '\n'. Parsed content:
> Privacy-shake",: a haptic interface for managing privacy settings in 
> mobile location sharing applications   privacy shake a haptic in

[jira] [Created] (SPARK-14345) Decouple deserializer expression resolution from ObjectOperator

2016-04-02 Thread Wenchen Fan (JIRA)

Wenchen Fan created SPARK-14345:
---

 Summary: Decouple deserializer expression resolution from 
ObjectOperator
 Key: SPARK-14345
 URL: https://issues.apache.org/jira/browse/SPARK-14345
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.0.0
Reporter: Wenchen Fan
Assignee: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14103) Python DataFrame CSV load on large file is writing to console in Ipython

2016-04-02 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15222912#comment-15222912
 ] 

Sean Owen commented on SPARK-14103:
---

I've used the Apache parser in the past and it has been fine. I have never used 
this one. That's a funny bug here. Any idea if it's known or easily fixable? 
that's the ideal way forward.

> Python DataFrame CSV load on large file is writing to console in Ipython
> 
>
> Key: SPARK-14103
> URL: https://issues.apache.org/jira/browse/SPARK-14103
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
> Environment: Ubuntu, Python 2.7.11, Anaconda 2.5.0, Spark from Master 
> branch
>Reporter: Shubhanshu Mishra
>  Labels: csv, csvparser, dataframe, pyspark
>
> I am using the spark from the master branch and when I run the following 
> command on a large tab separated file then I get the contents of the file 
> being written to the stderr
> {code}
> df = sqlContext.read.load("temp.txt", format="csv", header="false", 
> inferSchema="true", delimiter="\t")
> {code}
> Here is a sample of output:
> {code}
> ^M[Stage 1:>  (0 + 2) 
> / 2]16/03/23 14:01:02 ERROR Executor: Exception in task 1.0 in stage 1.0 (TID 
> 2)
> com.univocity.parsers.common.TextParsingException: Error processing input: 
> Length of parsed input (101) exceeds the maximum number of characters 
> defined in your parser settings (100). Identified line separator 
> characters in the parsed content. This may be the cause of the error. The 
> line separator in your parser settings is set to '\n'. Parsed content:
> Privacy-shake",: a haptic interface for managing privacy settings in 
> mobile location sharing applications   privacy shake a haptic interface 
> for managing privacy settings in mobile location sharing applications  2010   
>  2010/09/07  international conference on human computer 
> interaction  interact4333105819371[\n]
> 3D4F6CA1Between the Profiles: Another such Bias. Technology 
> Acceptance Studies on Social Network Services   between the profiles 
> another such bias technology acceptance studies on social network services 
> 20152015/08/02  10.1007/978-3-319-21383-5_12international 
> conference on human-computer interaction  interact43331058
> 19502[\n]
> ...
> .
> web snippets20082008/05/04  10.1007/978-3-642-01344-7_13
> international conference on web information systems and technologies
> webist  44F2980219489
> 06FA3FFAInteractive 3D User Interfaces for Neuroanatomy Exploration   
>   interactive 3d user interfaces for neuroanatomy exploration 2009
> internationa]
> at 
> com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:241)
> at 
> com.univocity.parsers.common.AbstractParser.parseNext(AbstractParser.java:356)
> at 
> org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:137)
> at 
> org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:120)
> at scala.collection.Iterator$class.foreach(Iterator.scala:742)
> at 
> org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.foreach(CSVParser.scala:120)
> at 
> scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:155)
> at 
> org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.foldLeft(CSVParser.scala:120)
> at 
> scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:212)
> at 
> org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.aggregate(CSVParser.scala:120)
> at 
> org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$22.apply(RDD.scala:1058)
> at 
> org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$22.apply(RDD.scala:1058)
> at 
> org.apache.spark.SparkContext$$anonfun$35.apply(SparkContext.scala:1827)
> at 
> org.apache.spark.SparkContext$$anonfun$35.apply(SparkContext.scala:1827)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:69)
> at org.apache.spark.scheduler.Task.run(Task.scala:82)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:231)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.ArrayIndexOutOfBoundsException
> 16/03/23 14:01:03 ERROR TaskSetManager: Task 0 in stage 1.0 failed 1 t

[jira] [Assigned] (SPARK-14345) Decouple deserializer expression resolution from ObjectOperator

2016-04-02 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14345:


Assignee: Apache Spark  (was: Wenchen Fan)

> Decouple deserializer expression resolution from ObjectOperator
> ---
>
> Key: SPARK-14345
> URL: https://issues.apache.org/jira/browse/SPARK-14345
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14345) Decouple deserializer expression resolution from ObjectOperator

2016-04-02 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14345:


Assignee: Wenchen Fan  (was: Apache Spark)

> Decouple deserializer expression resolution from ObjectOperator
> ---
>
> Key: SPARK-14345
> URL: https://issues.apache.org/jira/browse/SPARK-14345
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14345) Decouple deserializer expression resolution from ObjectOperator

2016-04-02 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15222913#comment-15222913
 ] 

Apache Spark commented on SPARK-14345:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/12131

> Decouple deserializer expression resolution from ObjectOperator
> ---
>
> Key: SPARK-14345
> URL: https://issues.apache.org/jira/browse/SPARK-14345
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-14178) DAGScheduler should get map output statuses directly, not by MapOutputTrackerMaster.getSerializedMapOutputStatuses.

2016-04-02 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-14178.
---
Resolution: Won't Fix

Resolving per comments in PR. It's not clear this is valid.

> DAGScheduler should get map output statuses directly, not by 
> MapOutputTrackerMaster.getSerializedMapOutputStatuses.
> ---
>
> Key: SPARK-14178
> URL: https://issues.apache.org/jira/browse/SPARK-14178
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Guoqiang Li
>
> DAGScheduler gets map output statuses by 
> {{MapOutputTrackerMaster.getSerializedMapOutputStatuses}}.
> [DAGScheduler.scala#L357 | 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L357]
> {noformat}
>   private def newOrUsedShuffleStage(
>   shuffleDep: ShuffleDependency[_, _, _],
>   firstJobId: Int): ShuffleMapStage = {
> val rdd = shuffleDep.rdd
> val numTasks = rdd.partitions.length
> val stage = newShuffleMapStage(rdd, numTasks, shuffleDep, firstJobId, 
> rdd.creationSite)
> if (mapOutputTracker.containsShuffle(shuffleDep.shuffleId)) {
>   val serLocs = 
> mapOutputTracker.getSerializedMapOutputStatuses(shuffleDep.shuffleId)
>   // Deserialization very time consuming. 
>  val locs = MapOutputTracker.deserializeMapStatuses(serLocs)
>   (0 until locs.length).foreach { i =>
> if (locs(i) ne null) {
>   // locs(i) will be null if missing
>   stage.addOutputLoc(i, locs(i))
> }
>   }
> } else {
>   // Kind of ugly: need to register RDDs with the cache and map output 
> tracker here
>   // since we can't do it in the RDD constructor because # of partitions 
> is unknown
>   logInfo("Registering RDD " + rdd.id + " (" + rdd.getCreationSite + ")")
>   mapOutputTracker.registerShuffle(shuffleDep.shuffleId, 
> rdd.partitions.length)
> }
> stage
>   }
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12772) Better error message for syntax error in the SQL parser

2016-04-02 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-12772:
--
Assignee: Herman van Hovell

> Better error message for syntax error in the SQL parser
> ---
>
> Key: SPARK-12772
> URL: https://issues.apache.org/jira/browse/SPARK-12772
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Reynold Xin
>Assignee: Herman van Hovell
> Fix For: 2.0.0
>
>
> {code}
> scala> sql("select case if(true, 'one', 'two')").explain(true)
> org.apache.spark.sql.AnalysisException: org.antlr.runtime.EarlyExitException
> line 1:34 required (...)+ loop did not match anything at input '' in 
> case expression
> ; line 1 pos 34
>   at 
> org.apache.spark.sql.catalyst.parser.ParseErrorReporter.throwError(ParseDriver.scala:140)
>   at 
> org.apache.spark.sql.catalyst.parser.ParseErrorReporter.throwError(ParseDriver.scala:129)
>   at 
> org.apache.spark.sql.catalyst.parser.ParseDriver$.parse(ParseDriver.scala:77)
>   at 
> org.apache.spark.sql.catalyst.CatalystQl.createPlan(CatalystQl.scala:53)
>   at 
> org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:41)
>   at 
> org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:40)
> {code}
> Is there a way to say something better other than "required (...)+ loop did 
> not match anything at input"?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-11327) spark-dispatcher doesn't pass along some spark properties

2016-04-02 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-11327:
--
Assignee: Jo Voordeckers

> spark-dispatcher doesn't pass along some spark properties
> -
>
> Key: SPARK-11327
> URL: https://issues.apache.org/jira/browse/SPARK-11327
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Reporter: Alan Braithwaite
>Assignee: Jo Voordeckers
> Fix For: 2.0.0
>
>
> I haven't figured out exactly what's going on yet, but there's something in 
> the spark-dispatcher which is failing to pass along properties to the 
> spark-driver when using spark-submit in a clustered mesos docker environment.
> Most importantly, it's not passing along spark.mesos.executor.docker.image.
> cli:
> {code}
> docker run -t -i --rm --net=host 
> --entrypoint=/usr/local/spark/bin/spark-submit 
> docker.example.com/spark:2015.10.2 --conf spark.driver.memory=8G --conf 
> spark.mesos.executor.docker.image=docker.example.com/spark:2015.10.2 --master 
> mesos://spark-dispatcher.example.com:31262 --deploy-mode cluster 
> --properties-file /usr/local/spark/conf/spark-defaults.conf --class 
> com.example.spark.streaming.MyApp 
> http://jarserver.example.com:8000/sparkapp.jar zk1.example.com:2181 
> spark-testing my-stream 40
> {code}
> submit output:
> {code}
> 15/10/26 22:03:53 INFO RestSubmissionClient: Submitting a request to launch 
> an application in mesos://compute1.example.com:31262.
> 15/10/26 22:03:53 DEBUG RestSubmissionClient: Sending POST request to server 
> at http://compute1.example.com:31262/v1/submissions/create:
> {
>   "action" : "CreateSubmissionRequest",
>   "appArgs" : [ "zk1.example.com:2181", "spark-testing", "requests", "40" ],
>   "appResource" : "http://jarserver.example.com:8000/sparkapp.jar";,
>   "clientSparkVersion" : "1.5.0",
>   "environmentVariables" : {
> "SPARK_SCALA_VERSION" : "2.10",
> "SPARK_CONF_DIR" : "/usr/local/spark/conf",
> "SPARK_HOME" : "/usr/local/spark",
> "SPARK_ENV_LOADED" : "1"
>   },
>   "mainClass" : "com.example.spark.streaming.MyApp",
>   "sparkProperties" : {
> "spark.serializer" : "org.apache.spark.serializer.KryoSerializer",
> "spark.executorEnv.MESOS_NATIVE_JAVA_LIBRARY" : 
> "/usr/local/lib/libmesos.so",
> "spark.history.fs.logDirectory" : "hdfs://hdfsha.example.com/spark/logs",
> "spark.eventLog.enabled" : "true",
> "spark.driver.maxResultSize" : "0",
> "spark.mesos.deploy.recoveryMode" : "ZOOKEEPER",
> "spark.mesos.deploy.zookeeper.url" : 
> "zk1.example.com:2181,zk2.example.com:2181,zk3.example.com:2181,zk4.example.com:2181,zk5.example.com:2181",
> "spark.jars" : "http://jarserver.example.com:8000/sparkapp.jar";,
> "spark.driver.supervise" : "false",
> "spark.app.name" : "com.example.spark.streaming.MyApp",
> "spark.driver.memory" : "8G",
> "spark.logConf" : "true",
> "spark.deploy.zookeeper.dir" : "/spark_mesos_dispatcher",
> "spark.mesos.executor.docker.image" : 
> "docker.example.com/spark-prod:2015.10.2",
> "spark.submit.deployMode" : "cluster",
> "spark.master" : "mesos://compute1.example.com:31262",
> "spark.executor.memory" : "8G",
> "spark.eventLog.dir" : "hdfs://hdfsha.example.com/spark/logs",
> "spark.mesos.docker.executor.network" : "HOST",
> "spark.mesos.executor.home" : "/usr/local/spark"
>   }
> }
> 15/10/26 22:03:53 DEBUG RestSubmissionClient: Response from the server:
> {
>   "action" : "CreateSubmissionResponse",
>   "serverSparkVersion" : "1.5.0",
>   "submissionId" : "driver-20151026220353-0011",
>   "success" : true
> }
> 15/10/26 22:03:53 INFO RestSubmissionClient: Submission successfully created 
> as driver-20151026220353-0011. Polling submission state...
> 15/10/26 22:03:53 INFO RestSubmissionClient: Submitting a request for the 
> status of submission driver-20151026220353-0011 in 
> mesos://compute1.example.com:31262.
> 15/10/26 22:03:53 DEBUG RestSubmissionClient: Sending GET request to server 
> at 
> http://compute1.example.com:31262/v1/submissions/status/driver-20151026220353-0011.
> 15/10/26 22:03:53 DEBUG RestSubmissionClient: Response from the server:
> {
>   "action" : "SubmissionStatusResponse",
>   "driverState" : "QUEUED",
>   "serverSparkVersion" : "1.5.0",
>   "submissionId" : "driver-20151026220353-0011",
>   "success" : true
> }
> 15/10/26 22:03:53 INFO RestSubmissionClient: State of driver 
> driver-20151026220353-0011 is now QUEUED.
> 15/10/26 22:03:53 INFO RestSubmissionClient: Server responded with 
> CreateSubmissionResponse:
> {
>   "action" : "CreateSubmissionResponse",
>   "serverSparkVersion" : "1.5.0",
>   "submissionId" : "driver-20151026220353-0011",
>   "success" : true
> }
> {code}
> driver log:
> {code}
> 15/10/26 22:08:08 INFO SparkContext: Running Sp

[jira] [Updated] (SPARK-12864) Fetch failure from AM restart

2016-04-02 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-12864:
--
Assignee: iward

> Fetch failure from AM restart
> -
>
> Key: SPARK-12864
> URL: https://issues.apache.org/jira/browse/SPARK-12864
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.3.1, 1.4.1, 1.5.2
>Reporter: iward
>Assignee: iward
> Fix For: 2.0.0
>
>
> Currently, when max number of executor failures reached the 
> *maxNumExecutorFailures*,  *ApplicationMaster* will be killed and re-register 
> another one.This time, *YarnAllocator* will be created a new instance.
> But, the value of property *executorIdCounter* in  *YarnAllocator* will reset 
> to *0*. Then the *Id* of new executor will starting from 1. This will confuse 
> with the executor has already created before, which will cause 
> FetchFailedException.
> For example, the following is the task log:
> {noformat}
> 2015-12-22 02:33:15 INFO 15/12/22 02:33:15 WARN 
> YarnSchedulerBackend$YarnSchedulerEndpoint: ApplicationMaster has 
> disassociated: 172.22.92.14:45125
> 2015-12-22 02:33:26 INFO 15/12/22 02:33:26 INFO 
> YarnSchedulerBackend$YarnSchedulerEndpoint: ApplicationMaster registered as 
> AkkaRpcEndpointRef(Actor[akka.tcp://sparkYarnAM@172.22.168.72:54040/user/YarnAM#-1290854604])
> {noformat}
> {noformat}
> 2015-12-22 02:35:02 INFO 15/12/22 02:35:02 INFO YarnClientSchedulerBackend: 
> Registered executor: 
> AkkaRpcEndpointRef(Actor[akka.tcp://sparkexecu...@bjhc-hera-16217.hadoop.jd.local:46538/user/Executor#-790726793])
>  with ID 1
> {noformat}
> {noformat}
> Lost task 3.0 in stage 102.0 (TID 1963, BJHC-HERA-16217.hadoop.jd.local): 
> FetchFailed(BlockManagerId(1, BJHC-HERA-17030.hadoop.jd.local, 7337
> ), shuffleId=5, mapId=2, reduceId=3, message=
> 2015-12-22 02:43:20 INFO org.apache.spark.shuffle.FetchFailedException: 
> /data3/yarn1/local/usercache/dd_edw/appcache/application_1450438154359_206399/blockmgr-b1fd0363-6d53-4d09-8086-adc4a13f4dc4/0f/shuffl
> e_5_2_0.index (No such file or directory)
> 2015-12-22 02:43:20 INFO at 
> org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.org$apache$spark$shuffle$hash$BlockStoreShuffleFetcher$$unpackBlock$1(BlockStoreShuffleFetcher.scala:67)
> 2015-12-22 02:43:20 INFO at 
> org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:84)
> 2015-12-22 02:43:20 INFO at 
> org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:84)
> 2015-12-22 02:43:20 INFO at 
> scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
> 2015-12-22 02:43:20 INFO at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
> 2015-12-22 02:43:20 INFO at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
> 2015-12-22 02:43:20 INFO at 
> scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> 2015-12-22 02:43:20 INFO at 
> scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> 2015-12-22 02:43:20 INFO at 
> org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7.apply(Aggregate.scala:154)
> 2015-12-22 02:43:20 INFO at 
> org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7.apply(Aggregate.scala:149)
> 2015-12-22 02:43:20 INFO at 
> org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:640)
> 2015-12-22 02:43:20 INFO at 
> org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:640)
> 2015-12-22 02:43:20 INFO at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> 2015-12-22 02:43:20 INFO at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> {noformat}
> As the task log show, the executor id of  *BJHC-HERA-16217.hadoop.jd.local* 
> is the same as *BJHC-HERA-17030.hadoop.jd.local*. So, it is confusion and 
> cause FetchFailedException.
> *And this situation of executorId conflict is just in yarn client mode due to 
> driver not running on yarn.*



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14308) Remove unused mllib tree classes and move private classes to ML

2016-04-02 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-14308:
--
Assignee: Seth Hendrickson

> Remove unused mllib tree classes and move private classes to ML
> ---
>
> Key: SPARK-14308
> URL: https://issues.apache.org/jira/browse/SPARK-14308
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, MLlib
>Reporter: Seth Hendrickson
>Assignee: Seth Hendrickson
>Priority: Minor
> Fix For: 2.0.0
>
>
> After [SPARK-12183|https://issues.apache.org/jira/browse/SPARK-12183], some 
> mllib tree internal helper classes are no longer used at all. Also, the 
> private helper classes internal to spark tree training can be ported very 
> easily to spark.ML without affecting APIs. This is the "low hanging fruit" 
> for porting tree internals to spark.ML, and will make the other migrations 
> more tractable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13995) Extract correct IsNotNull constraints for Expression

2016-04-02 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-13995:
--
Assignee: Liang-Chi Hsieh

> Extract correct IsNotNull constraints for Expression
> 
>
> Key: SPARK-13995
> URL: https://issues.apache.org/jira/browse/SPARK-13995
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
> Fix For: 2.0.0
>
>
> We infer relative `IsNotNull` constraints from logical plan's expressions in 
> `constructIsNotNullConstraints` now. However, we don't consider the case of 
> (nested) `Cast`.
> For example:
> val tr = LocalRelation('a.int, 'b.long)
> val plan = tr.where('a.attr === 'b.attr).analyze
> Then, the plan's constraints will have `IsNotNull(Cast(resolveColumn(tr, 
> "a"), LongType))`, instead of `IsNotNull(resolveColumn(tr, "a"))`. This PR 
> fixes it.
> Besides, as `IsNotNull` constraints are most useful for `Attribute`, we 
> should do recursing through any `Expression` that is null intolerant and 
> construct `IsNotNull` constraints for all `Attribute`s under these 
> Expressions.
> For example, consider the following constraints:
> val df = Seq((1,2,3)).toDF("a", "b", "c")
> df.where("a + b = c").queryExecution.analyzed.constraints
> The inferred isnotnull constraints should be isnotnull(a), isnotnull(b), 
> isnotnull(c), instead of isnotnull(a + c) and isnotnull(c).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14191) Fix Expand operator constraints

2016-04-02 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-14191:
--
Assignee: Liang-Chi Hsieh

> Fix Expand operator constraints
> ---
>
> Key: SPARK-14191
> URL: https://issues.apache.org/jira/browse/SPARK-14191
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
> Fix For: 2.0.0
>
>
> Expand operator now uses its child plan's constraints as its valid 
> constraints (i.e., the base of constraints). This is not correct because 
> Expand will set its group by attributes to null values. So the nullability of 
> these attributes should be true.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14138) Generated SpecificColumnarIterator code can exceed JVM size limit for cached DataFrames

2016-04-02 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-14138:
--
Assignee: Kazuaki Ishizaki

> Generated SpecificColumnarIterator code can exceed JVM size limit for cached 
> DataFrames
> ---
>
> Key: SPARK-14138
> URL: https://issues.apache.org/jira/browse/SPARK-14138
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Sven Krasser
>Assignee: Kazuaki Ishizaki
> Fix For: 2.0.0
>
>
> The generated {{SpecificColumnarIterator}} code for wide DataFrames can 
> exceed the JVM 64k limit under certain circumstances. This snippet reproduces 
> the error in spark-shell (with 5G driver memory) by creating a new DataFrame 
> with >2000 aggregation-based columns:
> {code}
> val df = sc.parallelize(1 to 10).toDF()
> val aggr = {1 to 2260}.map(colnum => avg(df.col("_1")).as(s"col_$colnum"))
> val res = df.groupBy("_1").agg(count("_1"), aggr: _*).cache()
> res.show() // this will break
> {code}
> The following error is produced (pruned for brevity):
> {noformat}
> /* 001 */
> /* 002 */ import java.nio.ByteBuffer;
> /* 003 */ import java.nio.ByteOrder;
> /* 004 */ import scala.collection.Iterator;
> /* 005 */ import org.apache.spark.sql.types.DataType;
> /* 006 */ import 
> org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder;
> /* 007 */ import 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter;
> /* 008 */ import org.apache.spark.sql.execution.columnar.MutableUnsafeRow;
> /* 009 */
> /* 010 */ public SpecificColumnarIterator 
> generate(org.apache.spark.sql.catalyst.expressions.Expression[] expr) {
> /* 011 */   return new SpecificColumnarIterator();
> /* 012 */ }
> /* 013 */
> ...
> /* 9113 */ accessor2261.extractTo(mutableRow, 2261);
> /* 9114 */ unsafeRow.pointTo(bufferHolder.buffer, 2262, 
> bufferHolder.totalSize());
> /* 9115 */ return unsafeRow;
> /* 9116 */   }
> /* 9117 */ }
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:555)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:575)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:572)
>   at 
> org.spark-project.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599)
>   at 
> org.spark-project.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2379)
>   ... 28 more
> Caused by: org.codehaus.janino.JaninoRuntimeException: Code of method "()Z" 
> of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificColumnarIterator"
>  grows beyond 64 KB
>   at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:941)
>   at org.codehaus.janino.CodeContext.write(CodeContext.java:836)
>   at org.codehaus.janino.UnitCompiler.writeOpcode(UnitCompiler.java:10251)
>   at org.codehaus.janino.UnitCompiler.invoke(UnitCompiler.java:10050)
>   at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:4008)
>   at org.codehaus.janino.UnitCompiler.access$6900(UnitCompiler.java:185)
>   at 
> org.codehaus.janino.UnitCompiler$10.visitMethodInvocation(UnitCompiler.java:3263)
>   at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:3974)
>   at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:3290)
>   at 
> org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:4368)
>   at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:3927)
>   at org.codehaus.janino.UnitCompiler.access$6900(UnitCompiler.java:185)
>   at 
> org.codehaus.janino.UnitCompiler$10.visitMethodInvocation(UnitCompiler.java:3263)
>   at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:3974)
>   at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:3290)
>   at 
> org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:4368)
>   at 
> org.codehaus.janino.UnitCompiler.invokeConstructor(UnitCompiler.java:6681)
>   at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:4126)
>   at org.codehaus.janino.UnitCompiler.access$7600(UnitCompiler.java:185)
>   at 
> org.codehaus.janino.UnitCompiler$10.visitNewClassInstance(UnitCompiler.java:3275)
>   at org.codehaus.janino.Java$NewClassInstance.accept(Java.java:4085)
>   at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:3290)
>   at 
> org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:4368)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2669)
>   at o

[jira] [Updated] (SPARK-14344) saveAsParquetFile creates _metafile even when disabled

2016-04-02 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-14344:
--
Target Version/s:   (was: 1.2.1)
   Fix Version/s: (was: 1.5.1)
  (was: 1.2.3)
  (was: 1.3.1)
  (was: 1.2.2)
  (was: 1.2.1)
  (was: 1.3.0)

[~kasjain] read 
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark first. 
Don't set target/fixed version. 1.2 is quite old now. Test vs master please?

> saveAsParquetFile creates _metafile even when disabled
> --
>
> Key: SPARK-14344
> URL: https://issues.apache.org/jira/browse/SPARK-14344
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.1
>Reporter: Kashish Jain
>Priority: Minor
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> Specifying the property "spark.hadoop.parquet.enable.summary-metadata false" 
> in spark.property does not prevent the creation of _metadata file in case of 
> rdd.saveAsParquetFile("TableName")



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14265) When stage is reRubmitted, DAG visualization does not render correctly for this stage

2016-04-02 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-14265:
--
Fix Version/s: (was: 2.0.0)

> When stage is reRubmitted,  DAG visualization does not render correctly for 
> this stage
> --
>
> Key: SPARK-14265
> URL: https://issues.apache.org/jira/browse/SPARK-14265
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.6.1
>Reporter: KaiXinXIaoLei
> Attachments: dagIsBlank.png
>
>
> I run queries using "bin/spark-sql --master yarn". A  stage run failed, and 
> will be reSubmitted. Then i check the  DAG visualization in web, it's blank.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14237) De-duplicate partition value appending logic in various buildReader() implementations

2016-04-02 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-14237:
--
Fix Version/s: (was: 2.0.0)

> De-duplicate partition value appending logic in various buildReader() 
> implementations
> -
>
> Key: SPARK-14237
> URL: https://issues.apache.org/jira/browse/SPARK-14237
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>Priority: Minor
>
> Various data sources share approximately the same code for partition value 
> appending. Would be nice to make it a utility method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14231) JSON data source fails to infer floats as decimal when precision is bigger than 38 or scale is bigger than precision.

2016-04-02 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-14231:
--
Fix Version/s: (was: 2.0.0)

> JSON data source fails to infer floats as decimal when precision is bigger 
> than 38 or scale is bigger than precision.
> -
>
> Key: SPARK-14231
> URL: https://issues.apache.org/jira/browse/SPARK-14231
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> Currently, JSON data source supports {{floatAsBigDecimal}} option, which 
> reads floats as {{DecimalType}}.
> I noticed there are several restrictions in Spark {{DecimalType}} below:
> 1. The precision cannot be bigger than 38.
> 2. scale cannot be bigger than precision. 
> However, with the option above, it reads {{BigDecimal}} which does not follow 
> the conditions above.
> This could be observed as below:
> {code}
> def simpleFloats: RDD[String] =
>   sqlContext.sparkContext.parallelize(
> """{"a": 0.01}""" ::
> """{"a": 0.02}""" :: Nil)
> val jsonDF = sqlContext.read
>   .option("floatAsBigDecimal", "true")
>   .json(simpleFloats)
> jsonDF.printSchema()
> {code}
> throws an exception below:
> {code}
> org.apache.spark.sql.AnalysisException: Decimal scale (2) cannot be greater 
> than precision (1).;
>   at org.apache.spark.sql.types.DecimalType.(DecimalType.scala:44)
>   at 
> org.apache.spark.sql.execution.datasources.json.InferSchema$.org$apache$spark$sql$execution$datasources$json$InferSchema$$inferField(InferSchema.scala:144)
>   at 
> org.apache.spark.sql.execution.datasources.json.InferSchema$.org$apache$spark$sql$execution$datasources$json$InferSchema$$inferField(InferSchema.scala:108)
>   at 
> org.apache.spark.sql.execution.datasources.json.InferSchema$$anonfun$1$$anonfun$apply$1$$anonfun$apply$3.apply(InferSchema.scala:59)
>   at 
> org.apache.spark.sql.execution.datasources.json.InferSchema$$anonfun$1$$anonfun$apply$1$$anonfun$apply$3.apply(InferSchema.scala:57)
>   at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2249)
>   at 
> org.apache.spark.sql.execution.datasources.json.InferSchema$$anonfun$1$$anonfun$apply$1.apply(InferSchema.scala:57)
>   at 
> org.apache.spark.sql.execution.datasources.json.InferSchema$$anonfun$1$$anonfun$apply$1.apply(InferSchema.scala:55)
>   at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:396)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:742)
> ...
> {code}
> Since JSON data source infers {{DataType}} as {{StringType}} when it fails to 
> infer, it might have to be inferred as {{StringType}} or maybe just simply 
> {{DoubleType}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14180) Deadlock in CoarseGrainedExecutorBackend Shutdown

2016-04-02 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-14180:
--
 Priority: Critical  (was: Blocker)
Fix Version/s: (was: 2.0.0)

> Deadlock in CoarseGrainedExecutorBackend Shutdown
> -
>
> Key: SPARK-14180
> URL: https://issues.apache.org/jira/browse/SPARK-14180
> Project: Spark
>  Issue Type: Bug
> Environment: master branch.  commit 
> d6dc12ef0146ae409834c78737c116050961f350
>Reporter: Michael Gummelt
>Priority: Critical
>
> I'm fairly certain that https://github.com/apache/spark/pull/11031 introduced 
> a deadlock in executor shutdown.  The result is executor shutdown hangs 
> indefinitely.  In Mesos at least, this lasts until 
> {{spark.mesos.coarse.shutdownTimeout}} (default 10s), at which point the 
> driver stops, which force kills the executors.
> The deadlock is as follows:
> - CoarseGrainedExecutorBackend receives a Shutdown message, which now blocks 
> on rpcEnv.awaitTermination() 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkEnv.scala#L95
> - rpcEnv.awaitTermination() blocks on dispatcher.awaitTermination(), which 
> blocks until all dispatcher threads (MessageLoop threads) terminate
> - However, the initial Shutdown message handling is itself handled by a 
> Dispatcher MessageLoop thread.  This mutual dependence results in a deadlock. 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rpc/netty/Dispatcher.scala#L216



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14103) Python DataFrame CSV load on large file is writing to console in Ipython

2016-04-02 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15222931#comment-15222931
 ] 

Hyukjin Kwon commented on SPARK-14103:
--

After thinking further, I realised that this might be a right behaviour in a 
way. I just checked the [Univocity parser 
API|http://docs.univocity.com/parsers/1.5.0/com/univocity/parsers/csv/CsvFormat.html]
 and this mentions a case about this at quote option. 

I think they intended to work like this as the value in the above case is 
anyway started with a {{quote}} character and this might imply that it is a 
value up to another ending {{quote}} character.

Maybe those quotes might have to be followed by escape characters or set 
{{quote}} to another character or {{null}} (not sure if {{null}} works though).

I haven't checked how Apache CSV works with this. Let me test this soon and 
will update if there is something else I should inform.

> Python DataFrame CSV load on large file is writing to console in Ipython
> 
>
> Key: SPARK-14103
> URL: https://issues.apache.org/jira/browse/SPARK-14103
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
> Environment: Ubuntu, Python 2.7.11, Anaconda 2.5.0, Spark from Master 
> branch
>Reporter: Shubhanshu Mishra
>  Labels: csv, csvparser, dataframe, pyspark
>
> I am using the spark from the master branch and when I run the following 
> command on a large tab separated file then I get the contents of the file 
> being written to the stderr
> {code}
> df = sqlContext.read.load("temp.txt", format="csv", header="false", 
> inferSchema="true", delimiter="\t")
> {code}
> Here is a sample of output:
> {code}
> ^M[Stage 1:>  (0 + 2) 
> / 2]16/03/23 14:01:02 ERROR Executor: Exception in task 1.0 in stage 1.0 (TID 
> 2)
> com.univocity.parsers.common.TextParsingException: Error processing input: 
> Length of parsed input (101) exceeds the maximum number of characters 
> defined in your parser settings (100). Identified line separator 
> characters in the parsed content. This may be the cause of the error. The 
> line separator in your parser settings is set to '\n'. Parsed content:
> Privacy-shake",: a haptic interface for managing privacy settings in 
> mobile location sharing applications   privacy shake a haptic interface 
> for managing privacy settings in mobile location sharing applications  2010   
>  2010/09/07  international conference on human computer 
> interaction  interact4333105819371[\n]
> 3D4F6CA1Between the Profiles: Another such Bias. Technology 
> Acceptance Studies on Social Network Services   between the profiles 
> another such bias technology acceptance studies on social network services 
> 20152015/08/02  10.1007/978-3-319-21383-5_12international 
> conference on human-computer interaction  interact43331058
> 19502[\n]
> ...
> .
> web snippets20082008/05/04  10.1007/978-3-642-01344-7_13
> international conference on web information systems and technologies
> webist  44F2980219489
> 06FA3FFAInteractive 3D User Interfaces for Neuroanatomy Exploration   
>   interactive 3d user interfaces for neuroanatomy exploration 2009
> internationa]
> at 
> com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:241)
> at 
> com.univocity.parsers.common.AbstractParser.parseNext(AbstractParser.java:356)
> at 
> org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:137)
> at 
> org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:120)
> at scala.collection.Iterator$class.foreach(Iterator.scala:742)
> at 
> org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.foreach(CSVParser.scala:120)
> at 
> scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:155)
> at 
> org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.foldLeft(CSVParser.scala:120)
> at 
> scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:212)
> at 
> org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.aggregate(CSVParser.scala:120)
> at 
> org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$22.apply(RDD.scala:1058)
> at 
> org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$22.apply(RDD.scala:1058)
> at 
> org.apache.spark.SparkContext$$anonfun$35.apply(SparkContext.scala:1827)
> at 
> org.apache.spark.SparkContext$$anonfun$35.apply(SparkContext.scala:1827)
> at org.apache.spark.

[jira] [Comment Edited] (SPARK-14103) Python DataFrame CSV load on large file is writing to console in Ipython

2016-04-02 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15222931#comment-15222931
 ] 

Hyukjin Kwon edited comment on SPARK-14103 at 4/2/16 4:22 PM:
--

After thinking further, I realised that this might be a right behaviour in a 
way. I just checked the [Univocity parser 
API|http://docs.univocity.com/parsers/1.5.0/com/univocity/parsers/csv/CsvFormat.html]
 and this mentions a case similar with this at quote option, although for me it 
still looks a bit weird because  I think it is sensible to parse {{"a"b}} to 
{{ab}}.

I think they intended to work like this as the value in the above case is 
anyway started with a {{quote}} character and this might imply that it is a 
value up to another ending {{quote}} character.

Maybe those quotes might have to be followed by escape characters or set 
{{quote}} to another character or {{null}} (not sure if {{null}} works though).

I haven't checked how Apache CSV works with this. Let me test this soon and 
will update if there is something else I should inform.


was (Author: hyukjin.kwon):
After thinking further, I realised that this might be a right behaviour in a 
way. I just checked the [Univocity parser 
API|http://docs.univocity.com/parsers/1.5.0/com/univocity/parsers/csv/CsvFormat.html]
 and this mentions a case about this at quote option. 

I think they intended to work like this as the value in the above case is 
anyway started with a {{quote}} character and this might imply that it is a 
value up to another ending {{quote}} character.

Maybe those quotes might have to be followed by escape characters or set 
{{quote}} to another character or {{null}} (not sure if {{null}} works though).

I haven't checked how Apache CSV works with this. Let me test this soon and 
will update if there is something else I should inform.

> Python DataFrame CSV load on large file is writing to console in Ipython
> 
>
> Key: SPARK-14103
> URL: https://issues.apache.org/jira/browse/SPARK-14103
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
> Environment: Ubuntu, Python 2.7.11, Anaconda 2.5.0, Spark from Master 
> branch
>Reporter: Shubhanshu Mishra
>  Labels: csv, csvparser, dataframe, pyspark
>
> I am using the spark from the master branch and when I run the following 
> command on a large tab separated file then I get the contents of the file 
> being written to the stderr
> {code}
> df = sqlContext.read.load("temp.txt", format="csv", header="false", 
> inferSchema="true", delimiter="\t")
> {code}
> Here is a sample of output:
> {code}
> ^M[Stage 1:>  (0 + 2) 
> / 2]16/03/23 14:01:02 ERROR Executor: Exception in task 1.0 in stage 1.0 (TID 
> 2)
> com.univocity.parsers.common.TextParsingException: Error processing input: 
> Length of parsed input (101) exceeds the maximum number of characters 
> defined in your parser settings (100). Identified line separator 
> characters in the parsed content. This may be the cause of the error. The 
> line separator in your parser settings is set to '\n'. Parsed content:
> Privacy-shake",: a haptic interface for managing privacy settings in 
> mobile location sharing applications   privacy shake a haptic interface 
> for managing privacy settings in mobile location sharing applications  2010   
>  2010/09/07  international conference on human computer 
> interaction  interact4333105819371[\n]
> 3D4F6CA1Between the Profiles: Another such Bias. Technology 
> Acceptance Studies on Social Network Services   between the profiles 
> another such bias technology acceptance studies on social network services 
> 20152015/08/02  10.1007/978-3-319-21383-5_12international 
> conference on human-computer interaction  interact43331058
> 19502[\n]
> ...
> .
> web snippets20082008/05/04  10.1007/978-3-642-01344-7_13
> international conference on web information systems and technologies
> webist  44F2980219489
> 06FA3FFAInteractive 3D User Interfaces for Neuroanatomy Exploration   
>   interactive 3d user interfaces for neuroanatomy exploration 2009
> internationa]
> at 
> com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:241)
> at 
> com.univocity.parsers.common.AbstractParser.parseNext(AbstractParser.java:356)
> at 
> org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:137)
> at 
> org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:120)
> at scala.collection.Iterat

[jira] [Comment Edited] (SPARK-14103) Python DataFrame CSV load on large file is writing to console in Ipython

2016-04-02 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15222931#comment-15222931
 ] 

Hyukjin Kwon edited comment on SPARK-14103 at 4/2/16 4:23 PM:
--

After thinking further, I realised that this might be a right behaviour in a 
way. I just checked the [Univocity parser 
API|http://docs.univocity.com/parsers/1.5.0/com/univocity/parsers/csv/CsvFormat.html]
 and this mentions a case similar with this at quote option, although for me it 
still looks a bit weird because  I think it is sensible to parse {{"a"b}} to 
{{ab}}.

I think they intended to work like this as the value in the above case is 
anyway started with a {{quote}} character and this might imply that it is a 
value up to another ending {{quote}} character right before a delimiter.

Maybe those quotes might have to be followed by escape characters or set 
{{quote}} to another character or {{null}} (not sure if {{null}} works though).

I haven't checked how Apache CSV works with this. Let me test this soon and 
will update if there is something else I should inform.


was (Author: hyukjin.kwon):
After thinking further, I realised that this might be a right behaviour in a 
way. I just checked the [Univocity parser 
API|http://docs.univocity.com/parsers/1.5.0/com/univocity/parsers/csv/CsvFormat.html]
 and this mentions a case similar with this at quote option, although for me it 
still looks a bit weird because  I think it is sensible to parse {{"a"b}} to 
{{ab}}.

I think they intended to work like this as the value in the above case is 
anyway started with a {{quote}} character and this might imply that it is a 
value up to another ending {{quote}} character.

Maybe those quotes might have to be followed by escape characters or set 
{{quote}} to another character or {{null}} (not sure if {{null}} works though).

I haven't checked how Apache CSV works with this. Let me test this soon and 
will update if there is something else I should inform.

> Python DataFrame CSV load on large file is writing to console in Ipython
> 
>
> Key: SPARK-14103
> URL: https://issues.apache.org/jira/browse/SPARK-14103
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
> Environment: Ubuntu, Python 2.7.11, Anaconda 2.5.0, Spark from Master 
> branch
>Reporter: Shubhanshu Mishra
>  Labels: csv, csvparser, dataframe, pyspark
>
> I am using the spark from the master branch and when I run the following 
> command on a large tab separated file then I get the contents of the file 
> being written to the stderr
> {code}
> df = sqlContext.read.load("temp.txt", format="csv", header="false", 
> inferSchema="true", delimiter="\t")
> {code}
> Here is a sample of output:
> {code}
> ^M[Stage 1:>  (0 + 2) 
> / 2]16/03/23 14:01:02 ERROR Executor: Exception in task 1.0 in stage 1.0 (TID 
> 2)
> com.univocity.parsers.common.TextParsingException: Error processing input: 
> Length of parsed input (101) exceeds the maximum number of characters 
> defined in your parser settings (100). Identified line separator 
> characters in the parsed content. This may be the cause of the error. The 
> line separator in your parser settings is set to '\n'. Parsed content:
> Privacy-shake",: a haptic interface for managing privacy settings in 
> mobile location sharing applications   privacy shake a haptic interface 
> for managing privacy settings in mobile location sharing applications  2010   
>  2010/09/07  international conference on human computer 
> interaction  interact4333105819371[\n]
> 3D4F6CA1Between the Profiles: Another such Bias. Technology 
> Acceptance Studies on Social Network Services   between the profiles 
> another such bias technology acceptance studies on social network services 
> 20152015/08/02  10.1007/978-3-319-21383-5_12international 
> conference on human-computer interaction  interact43331058
> 19502[\n]
> ...
> .
> web snippets20082008/05/04  10.1007/978-3-642-01344-7_13
> international conference on web information systems and technologies
> webist  44F2980219489
> 06FA3FFAInteractive 3D User Interfaces for Neuroanatomy Exploration   
>   interactive 3d user interfaces for neuroanatomy exploration 2009
> internationa]
> at 
> com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:241)
> at 
> com.univocity.parsers.common.AbstractParser.parseNext(AbstractParser.java:356)
> at 
> org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:137)
>

[jira] [Commented] (SPARK-14103) Python DataFrame CSV load on large file is writing to console in Ipython

2016-04-02 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15222938#comment-15222938
 ] 

Sean Owen commented on SPARK-14103:
---

I don't think this case is ambiguous. The second " appears alone without 
preceding \ or following ". However I don't know if it's valid to quote only 
part of a field in CSV. And it doesn't seem to match the intent; the content 
should escape those quotes. I think you could argue it's a bad input problem 
but the error is odd.

> Python DataFrame CSV load on large file is writing to console in Ipython
> 
>
> Key: SPARK-14103
> URL: https://issues.apache.org/jira/browse/SPARK-14103
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
> Environment: Ubuntu, Python 2.7.11, Anaconda 2.5.0, Spark from Master 
> branch
>Reporter: Shubhanshu Mishra
>  Labels: csv, csvparser, dataframe, pyspark
>
> I am using the spark from the master branch and when I run the following 
> command on a large tab separated file then I get the contents of the file 
> being written to the stderr
> {code}
> df = sqlContext.read.load("temp.txt", format="csv", header="false", 
> inferSchema="true", delimiter="\t")
> {code}
> Here is a sample of output:
> {code}
> ^M[Stage 1:>  (0 + 2) 
> / 2]16/03/23 14:01:02 ERROR Executor: Exception in task 1.0 in stage 1.0 (TID 
> 2)
> com.univocity.parsers.common.TextParsingException: Error processing input: 
> Length of parsed input (101) exceeds the maximum number of characters 
> defined in your parser settings (100). Identified line separator 
> characters in the parsed content. This may be the cause of the error. The 
> line separator in your parser settings is set to '\n'. Parsed content:
> Privacy-shake",: a haptic interface for managing privacy settings in 
> mobile location sharing applications   privacy shake a haptic interface 
> for managing privacy settings in mobile location sharing applications  2010   
>  2010/09/07  international conference on human computer 
> interaction  interact4333105819371[\n]
> 3D4F6CA1Between the Profiles: Another such Bias. Technology 
> Acceptance Studies on Social Network Services   between the profiles 
> another such bias technology acceptance studies on social network services 
> 20152015/08/02  10.1007/978-3-319-21383-5_12international 
> conference on human-computer interaction  interact43331058
> 19502[\n]
> ...
> .
> web snippets20082008/05/04  10.1007/978-3-642-01344-7_13
> international conference on web information systems and technologies
> webist  44F2980219489
> 06FA3FFAInteractive 3D User Interfaces for Neuroanatomy Exploration   
>   interactive 3d user interfaces for neuroanatomy exploration 2009
> internationa]
> at 
> com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:241)
> at 
> com.univocity.parsers.common.AbstractParser.parseNext(AbstractParser.java:356)
> at 
> org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:137)
> at 
> org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:120)
> at scala.collection.Iterator$class.foreach(Iterator.scala:742)
> at 
> org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.foreach(CSVParser.scala:120)
> at 
> scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:155)
> at 
> org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.foldLeft(CSVParser.scala:120)
> at 
> scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:212)
> at 
> org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.aggregate(CSVParser.scala:120)
> at 
> org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$22.apply(RDD.scala:1058)
> at 
> org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$22.apply(RDD.scala:1058)
> at 
> org.apache.spark.SparkContext$$anonfun$35.apply(SparkContext.scala:1827)
> at 
> org.apache.spark.SparkContext$$anonfun$35.apply(SparkContext.scala:1827)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:69)
> at org.apache.spark.scheduler.Task.run(Task.scala:82)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:231)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Th

[jira] [Updated] (SPARK-14130) [Table related commands] Alter column

2016-04-02 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-14130:
-
Description: 
For alter column command, we have the following tokens.
TOK_ALTERTABLE_RENAMECOL
TOK_ALTERTABLE_ADDCOLS
TOK_ALTERTABLE_REPLACECOLS

For data source tables, we should throw exceptions. For Hive tables, we should 
support them. *For Hive tables, we should check Hive's behavior because for a 
file format, it may not support all of above commands*. 


  was:
For alter column command, we have the following tokens.
TOK_ALTERTABLE_RENAMECOL
TOK_ALTERTABLE_ADDCOLS
TOK_ALTERTABLE_REPLACECOLS

For data source tables, we should throw exceptions. For Hive tables, we should 
support them.



> [Table related commands] Alter column
> -
>
> Key: SPARK-14130
> URL: https://issues.apache.org/jira/browse/SPARK-14130
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>
> For alter column command, we have the following tokens.
> TOK_ALTERTABLE_RENAMECOL
> TOK_ALTERTABLE_ADDCOLS
> TOK_ALTERTABLE_REPLACECOLS
> For data source tables, we should throw exceptions. For Hive tables, we 
> should support them. *For Hive tables, we should check Hive's behavior 
> because for a file format, it may not support all of above commands*. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14130) [Table related commands] Alter column

2016-04-02 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-14130:
-
Description: 
For alter column command, we have the following tokens.
TOK_ALTERTABLE_RENAMECOL
TOK_ALTERTABLE_ADDCOLS
TOK_ALTERTABLE_REPLACECOLS

For data source tables, we should throw exceptions. For Hive tables, we should 
support them. *For Hive tables, we should check Hive's behavior to see if there 
is any file format that does not any of above command*. 
https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java
 is a good reference for Hive's behavior. 


  was:
For alter column command, we have the following tokens.
TOK_ALTERTABLE_RENAMECOL
TOK_ALTERTABLE_ADDCOLS
TOK_ALTERTABLE_REPLACECOLS

For data source tables, we should throw exceptions. For Hive tables, we should 
support them. *For Hive tables, we should check Hive's behavior because for a 
file format, it may not support all of above commands*. 



> [Table related commands] Alter column
> -
>
> Key: SPARK-14130
> URL: https://issues.apache.org/jira/browse/SPARK-14130
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>
> For alter column command, we have the following tokens.
> TOK_ALTERTABLE_RENAMECOL
> TOK_ALTERTABLE_ADDCOLS
> TOK_ALTERTABLE_REPLACECOLS
> For data source tables, we should throw exceptions. For Hive tables, we 
> should support them. *For Hive tables, we should check Hive's behavior to see 
> if there is any file format that does not any of above command*. 
> https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java
>  is a good reference for Hive's behavior. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14103) Python DataFrame CSV load on large file is writing to console in Ipython

2016-04-02 Thread Shubhanshu Mishra (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15223006#comment-15223006
 ] 

Shubhanshu Mishra commented on SPARK-14103:
---

[~hyukjin.kwon] thanks for pointing this out. I used the `quote=""` as a value 
and the dataframe reader was able to correctly parse the file. 

{code}
df = sqlContext.read.load("temp.txt", format="csv", header="false", quote="", 
inferSchema="true", delimiter="\t") # WORKS
{code}

After your comment, I looked at the 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala
 file which sets the default quote character to `"`, however, in the `getChar` 
function, it is mentioned if the length of the option is 0 then the value will 
be set to the null unicode char `\u000`. 

I think this fixes up this issue. However, the long error message should be 
taken care of. 



> Python DataFrame CSV load on large file is writing to console in Ipython
> 
>
> Key: SPARK-14103
> URL: https://issues.apache.org/jira/browse/SPARK-14103
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
> Environment: Ubuntu, Python 2.7.11, Anaconda 2.5.0, Spark from Master 
> branch
>Reporter: Shubhanshu Mishra
>  Labels: csv, csvparser, dataframe, pyspark
>
> I am using the spark from the master branch and when I run the following 
> command on a large tab separated file then I get the contents of the file 
> being written to the stderr
> {code}
> df = sqlContext.read.load("temp.txt", format="csv", header="false", 
> inferSchema="true", delimiter="\t")
> {code}
> Here is a sample of output:
> {code}
> ^M[Stage 1:>  (0 + 2) 
> / 2]16/03/23 14:01:02 ERROR Executor: Exception in task 1.0 in stage 1.0 (TID 
> 2)
> com.univocity.parsers.common.TextParsingException: Error processing input: 
> Length of parsed input (101) exceeds the maximum number of characters 
> defined in your parser settings (100). Identified line separator 
> characters in the parsed content. This may be the cause of the error. The 
> line separator in your parser settings is set to '\n'. Parsed content:
> Privacy-shake",: a haptic interface for managing privacy settings in 
> mobile location sharing applications   privacy shake a haptic interface 
> for managing privacy settings in mobile location sharing applications  2010   
>  2010/09/07  international conference on human computer 
> interaction  interact4333105819371[\n]
> 3D4F6CA1Between the Profiles: Another such Bias. Technology 
> Acceptance Studies on Social Network Services   between the profiles 
> another such bias technology acceptance studies on social network services 
> 20152015/08/02  10.1007/978-3-319-21383-5_12international 
> conference on human-computer interaction  interact43331058
> 19502[\n]
> ...
> .
> web snippets20082008/05/04  10.1007/978-3-642-01344-7_13
> international conference on web information systems and technologies
> webist  44F2980219489
> 06FA3FFAInteractive 3D User Interfaces for Neuroanatomy Exploration   
>   interactive 3d user interfaces for neuroanatomy exploration 2009
> internationa]
> at 
> com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:241)
> at 
> com.univocity.parsers.common.AbstractParser.parseNext(AbstractParser.java:356)
> at 
> org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:137)
> at 
> org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:120)
> at scala.collection.Iterator$class.foreach(Iterator.scala:742)
> at 
> org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.foreach(CSVParser.scala:120)
> at 
> scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:155)
> at 
> org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.foldLeft(CSVParser.scala:120)
> at 
> scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:212)
> at 
> org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.aggregate(CSVParser.scala:120)
> at 
> org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$22.apply(RDD.scala:1058)
> at 
> org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$22.apply(RDD.scala:1058)
> at 
> org.apache.spark.SparkContext$$anonfun$35.apply(SparkContext.scala:1827)
> at 
> org.apache.spark.SparkContext$$anonfun$35.apply(SparkContext.scala:1827)
> at org.apach

[jira] [Comment Edited] (SPARK-14103) Python DataFrame CSV load on large file is writing to console in Ipython

2016-04-02 Thread Shubhanshu Mishra (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15223006#comment-15223006
 ] 

Shubhanshu Mishra edited comment on SPARK-14103 at 4/2/16 6:51 PM:
---

[~hyukjin.kwon] thanks for pointing this out. I used the {code}quote=""{code} 
as a value and the dataframe reader was able to correctly parse the file. 

{code}
df = sqlContext.read.load("temp.txt", format="csv", header="false", quote="", 
inferSchema="true", delimiter="\t") # WORKS
{code}

After your comment, I looked at the 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala
 file which sets the default quote character to {code}"{code}, however, in the 
{code}getChar{code} function, it is mentioned if the length of the option is 0 
then the value will be set to the null unicode char {code}\u000{code}. 

I think this fixes up this issue. However, the long error message should be 
taken care of. 




was (Author: shubhanshumis...@gmail.com):
[~hyukjin.kwon] thanks for pointing this out. I used the `quote=""` as a value 
and the dataframe reader was able to correctly parse the file. 

{code}
df = sqlContext.read.load("temp.txt", format="csv", header="false", quote="", 
inferSchema="true", delimiter="\t") # WORKS
{code}

After your comment, I looked at the 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala
 file which sets the default quote character to `"`, however, in the `getChar` 
function, it is mentioned if the length of the option is 0 then the value will 
be set to the null unicode char `\u000`. 

I think this fixes up this issue. However, the long error message should be 
taken care of. 



> Python DataFrame CSV load on large file is writing to console in Ipython
> 
>
> Key: SPARK-14103
> URL: https://issues.apache.org/jira/browse/SPARK-14103
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
> Environment: Ubuntu, Python 2.7.11, Anaconda 2.5.0, Spark from Master 
> branch
>Reporter: Shubhanshu Mishra
>  Labels: csv, csvparser, dataframe, pyspark
>
> I am using the spark from the master branch and when I run the following 
> command on a large tab separated file then I get the contents of the file 
> being written to the stderr
> {code}
> df = sqlContext.read.load("temp.txt", format="csv", header="false", 
> inferSchema="true", delimiter="\t")
> {code}
> Here is a sample of output:
> {code}
> ^M[Stage 1:>  (0 + 2) 
> / 2]16/03/23 14:01:02 ERROR Executor: Exception in task 1.0 in stage 1.0 (TID 
> 2)
> com.univocity.parsers.common.TextParsingException: Error processing input: 
> Length of parsed input (101) exceeds the maximum number of characters 
> defined in your parser settings (100). Identified line separator 
> characters in the parsed content. This may be the cause of the error. The 
> line separator in your parser settings is set to '\n'. Parsed content:
> Privacy-shake",: a haptic interface for managing privacy settings in 
> mobile location sharing applications   privacy shake a haptic interface 
> for managing privacy settings in mobile location sharing applications  2010   
>  2010/09/07  international conference on human computer 
> interaction  interact4333105819371[\n]
> 3D4F6CA1Between the Profiles: Another such Bias. Technology 
> Acceptance Studies on Social Network Services   between the profiles 
> another such bias technology acceptance studies on social network services 
> 20152015/08/02  10.1007/978-3-319-21383-5_12international 
> conference on human-computer interaction  interact43331058
> 19502[\n]
> ...
> .
> web snippets20082008/05/04  10.1007/978-3-642-01344-7_13
> international conference on web information systems and technologies
> webist  44F2980219489
> 06FA3FFAInteractive 3D User Interfaces for Neuroanatomy Exploration   
>   interactive 3d user interfaces for neuroanatomy exploration 2009
> internationa]
> at 
> com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:241)
> at 
> com.univocity.parsers.common.AbstractParser.parseNext(AbstractParser.java:356)
> at 
> org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:137)
> at 
> org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:120)
> at scala.collection.Iterator$class.foreach(Iterator.scala:742)
> at 
>

[jira] [Commented] (SPARK-14083) Analyze JVM bytecode and turn closures into Catalyst expressions

2016-04-02 Thread Josh Rosen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15223010#comment-15223010
 ] 

Josh Rosen commented on SPARK-14083:


Here's one example of how we might aim to preserve Java/Scala closure API null 
behavior for field accesses:

Consider the following closure:

{code}
val ds = Seq[(String, Integer)](("a", 1), ("b", 2), ("c", 3), (null, 
null)).toDF()
ds.filter(r => r.getInt(1) == 2).collect()
{code}

This code will fail with a NullPointerException in the getInt() call (per its 
contract). This closure's bytecode looks like this:

{code}
aload_1
iconst_1
invokeinterface #22 = Method org.apache.spark.sql.Row.getInt((I)I)
iconst_2
if_icmpne 15
iconst_1
goto 16
iconst_0
ireturn
{code}

My most recent prototype converts this into

{code}
cast(if (NOT (npeonnull(_2#3) = 2)) 0 else 1 as boolean)
{code}

where {{npeonnull}} is a new non-SQL expression which throws a null pointer 
exception on null inputs. If we trust our nullability analysis optimization 
rules, then we could add a trivial optimizer rule to eliminate {{npeonnull}} 
calls when their children are non-nullable.

If a user wanted to implement the SQL filter semantics here, then they could 
rewrite their closure to

{code}
  ds.filter(r => !r.isNullAt(1) && r.getInt(1) == 2)
{code}

My prototype translates this closure into

{code}
cast(if (isnull(_2#3)) 0 else if (NOT (npeonnull(_2#3) = 2)) 0 else 1 as 
boolean)
{code}

Again, I think that this could be easily simplified given some new optimizer 
rules:

- We can propagate the negation of the `if` condition into the attributes of 
the else branch.
- Therefore, we can conclude that column 2 is not null when analyzing the else 
case and can strip out the `npeonnull` check.
- After both optimizations plus cast pushdown, constant folding, and an 
optimization for rewriting {{if}} expressions with non-nullable conditions by 
the condition expression itself, I think we could produce exactly the same 
{{filter _2#3 = 2}} expression that the Catalyst expression DSL would have 
given us.

> Analyze JVM bytecode and turn closures into Catalyst expressions
> 
>
> Key: SPARK-14083
> URL: https://issues.apache.org/jira/browse/SPARK-14083
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>
> One big advantage of the Dataset API is the type safety, at the cost of 
> performance due to heavy reliance on user-defined closures/lambdas. These 
> closures are typically slower than expressions because we have more 
> flexibility to optimize expressions (known data types, no virtual function 
> calls, etc). In many cases, it's actually not going to be very difficult to 
> look into the byte code of these closures and figure out what they are trying 
> to do. If we can understand them, then we can turn them directly into 
> Catalyst expressions for more optimized executions.
> Some examples are:
> {code}
> df.map(_.name)  // equivalent to expression col("name")
> ds.groupBy(_.gender)  // equivalent to expression col("gender")
> df.filter(_.age > 18)  // equivalent to expression GreaterThan(col("age"), 
> lit(18)
> df.map(_.id + 1)  // equivalent to Add(col("age"), lit(1))
> {code}
> The goal of this ticket is to design a small framework for byte code analysis 
> and use that to convert closures/lambdas into Catalyst expressions in order 
> to speed up Dataset execution. It is a little bit futuristic, but I believe 
> it is very doable. The framework should be easy to reason about (e.g. similar 
> to Catalyst).
> Note that a big emphasis on "small" and "easy to reason about". A patch 
> should be rejected if it is too complicated or difficult to reason about.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-14083) Analyze JVM bytecode and turn closures into Catalyst expressions

2016-04-02 Thread Josh Rosen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15223010#comment-15223010
 ] 

Josh Rosen edited comment on SPARK-14083 at 4/2/16 7:02 PM:


Here's one example of how we might aim to preserve Java/Scala closure API null 
behavior for field accesses:

Consider the following closure:

{code}
val ds = Seq[(String, Integer)](("a", 1), ("b", 2), ("c", 3), (null, 
null)).toDF()
ds.filter(r => r.getInt(1) == 2).collect()
{code}

This code will fail with a NullPointerException in the getInt() call (per its 
contract). This closure's bytecode looks like this:

{code}
aload_1
iconst_1
invokeinterface #22 = Method org.apache.spark.sql.Row.getInt((I)I)
iconst_2
if_icmpne 15
iconst_1
goto 16
iconst_0
ireturn
{code}

My most recent prototype converts this into

{code}
cast(if (NOT (npeonnull(_2#3) = 2)) 0 else 1 as boolean)
{code}

where {{npeonnull}} is a new non-SQL expression which throws a null pointer 
exception on null inputs. If we trust our nullability analysis optimization 
rules, then we could add a trivial optimizer rule to eliminate {{npeonnull}} 
calls when their children are non-nullable.

If a user wanted to implement the SQL filter semantics here, then they could 
rewrite their closure to

{code}
  ds.filter(r => !r.isNullAt(1) && r.getInt(1) == 2)
{code}

My prototype translates this closure into

{code}
cast(if (isnull(_2#3)) 0 else if (NOT (npeonnull(_2#3) = 2)) 0 else 1 as 
boolean)
{code}

Again, I think that this could be easily simplified given some new optimizer 
rules:

- We can propagate the negation of the `if` condition into the attributes of 
the else branch.
- Therefore, we can conclude that column 2 is not null when analyzing the else 
case and can strip out the `npeonnull` check.
- After both optimizations plus cast pushdown, constant folding, and an 
optimization for rewriting {{if(condition, trueLiteral, falseLiteral)}} 
expressions with non-nullable conditions by the condition expression itself, I 
think we could produce exactly the same {{filter _2#3 = 2}} expression that the 
Catalyst expression DSL would have given us.


was (Author: joshrosen):
Here's one example of how we might aim to preserve Java/Scala closure API null 
behavior for field accesses:

Consider the following closure:

{code}
val ds = Seq[(String, Integer)](("a", 1), ("b", 2), ("c", 3), (null, 
null)).toDF()
ds.filter(r => r.getInt(1) == 2).collect()
{code}

This code will fail with a NullPointerException in the getInt() call (per its 
contract). This closure's bytecode looks like this:

{code}
aload_1
iconst_1
invokeinterface #22 = Method org.apache.spark.sql.Row.getInt((I)I)
iconst_2
if_icmpne 15
iconst_1
goto 16
iconst_0
ireturn
{code}

My most recent prototype converts this into

{code}
cast(if (NOT (npeonnull(_2#3) = 2)) 0 else 1 as boolean)
{code}

where {{npeonnull}} is a new non-SQL expression which throws a null pointer 
exception on null inputs. If we trust our nullability analysis optimization 
rules, then we could add a trivial optimizer rule to eliminate {{npeonnull}} 
calls when their children are non-nullable.

If a user wanted to implement the SQL filter semantics here, then they could 
rewrite their closure to

{code}
  ds.filter(r => !r.isNullAt(1) && r.getInt(1) == 2)
{code}

My prototype translates this closure into

{code}
cast(if (isnull(_2#3)) 0 else if (NOT (npeonnull(_2#3) = 2)) 0 else 1 as 
boolean)
{code}

Again, I think that this could be easily simplified given some new optimizer 
rules:

- We can propagate the negation of the `if` condition into the attributes of 
the else branch.
- Therefore, we can conclude that column 2 is not null when analyzing the else 
case and can strip out the `npeonnull` check.
- After both optimizations plus cast pushdown, constant folding, and an 
optimization for rewriting {{if}} expressions with non-nullable conditions by 
the condition expression itself, I think we could produce exactly the same 
{{filter _2#3 = 2}} expression that the Catalyst expression DSL would have 
given us.

> Analyze JVM bytecode and turn closures into Catalyst expressions
> 
>
> Key: SPARK-14083
> URL: https://issues.apache.org/jira/browse/SPARK-14083
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>
> One big advantage of the Dataset API is the type safety, at the cost of 
> performance due to heavy reliance on user-defined closures/lambdas. These 
> closures are typically slower than expressions because we have more 
> flexibility to optimize expressions (known data types, no virtual function 
> calls, etc). In many cases, it's actually not going to be very difficult to 
> look into the byte code of these closures and figure out

[jira] [Created] (SPARK-14346) SHOW CREATE TABLE command (Native)

2016-04-02 Thread Xin Wu (JIRA)

Xin Wu created SPARK-14346:
--

 Summary: SHOW CREATE TABLE command (Native)
 Key: SPARK-14346
 URL: https://issues.apache.org/jira/browse/SPARK-14346
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.0.0
Reporter: Xin Wu


This command will return a CREATE TABLE command in SQL. Right now, we just 
throw exception (I was not sure how often people will use it). Since it is a 
pretty standalone work (generating a CREATE TABLE command based on the metadata 
of a table) and people may find it pretty useful, I am thinking to get it in 
2.0. Hive's implementation can be found at 
https://github.com/apache/hive/blob/release-1.2.1/ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java#L1898-L2126.
 The main difference for spark is that if we have a data source table, we 
should use Spark's syntax (CREATE TABLE ... USING ... OPTIONS) instead of 
Hive's syntax.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14346) SHOW CREATE TABLE command (Native)

2016-04-02 Thread Xin Wu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xin Wu updated SPARK-14346:
---
Issue Type: Sub-task  (was: Improvement)
Parent: SPARK-14118

> SHOW CREATE TABLE command (Native)
> --
>
> Key: SPARK-14346
> URL: https://issues.apache.org/jira/browse/SPARK-14346
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xin Wu
>
> This command will return a CREATE TABLE command in SQL. Right now, we just 
> throw exception (I was not sure how often people will use it). Since it is a 
> pretty standalone work (generating a CREATE TABLE command based on the 
> metadata of a table) and people may find it pretty useful, I am thinking to 
> get it in 2.0. Hive's implementation can be found at 
> https://github.com/apache/hive/blob/release-1.2.1/ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java#L1898-L2126.
>  The main difference for spark is that if we have a data source table, we 
> should use Spark's syntax (CREATE TABLE ... USING ... OPTIONS) instead of 
> Hive's syntax.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14346) SHOW CREATE TABLE command (Native)

2016-04-02 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14346:


Assignee: Apache Spark

> SHOW CREATE TABLE command (Native)
> --
>
> Key: SPARK-14346
> URL: https://issues.apache.org/jira/browse/SPARK-14346
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xin Wu
>Assignee: Apache Spark
>
> This command will return a CREATE TABLE command in SQL. Right now, we just 
> throw exception (I was not sure how often people will use it). Since it is a 
> pretty standalone work (generating a CREATE TABLE command based on the 
> metadata of a table) and people may find it pretty useful, I am thinking to 
> get it in 2.0. Hive's implementation can be found at 
> https://github.com/apache/hive/blob/release-1.2.1/ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java#L1898-L2126.
>  The main difference for spark is that if we have a data source table, we 
> should use Spark's syntax (CREATE TABLE ... USING ... OPTIONS) instead of 
> Hive's syntax.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14346) SHOW CREATE TABLE command (Native)

2016-04-02 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15223028#comment-15223028
 ] 

Apache Spark commented on SPARK-14346:
--

User 'xwu0226' has created a pull request for this issue:
https://github.com/apache/spark/pull/12132

> SHOW CREATE TABLE command (Native)
> --
>
> Key: SPARK-14346
> URL: https://issues.apache.org/jira/browse/SPARK-14346
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xin Wu
>
> This command will return a CREATE TABLE command in SQL. Right now, we just 
> throw exception (I was not sure how often people will use it). Since it is a 
> pretty standalone work (generating a CREATE TABLE command based on the 
> metadata of a table) and people may find it pretty useful, I am thinking to 
> get it in 2.0. Hive's implementation can be found at 
> https://github.com/apache/hive/blob/release-1.2.1/ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java#L1898-L2126.
>  The main difference for spark is that if we have a data source table, we 
> should use Spark's syntax (CREATE TABLE ... USING ... OPTIONS) instead of 
> Hive's syntax.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14346) SHOW CREATE TABLE command (Native)

2016-04-02 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14346:


Assignee: (was: Apache Spark)

> SHOW CREATE TABLE command (Native)
> --
>
> Key: SPARK-14346
> URL: https://issues.apache.org/jira/browse/SPARK-14346
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xin Wu
>
> This command will return a CREATE TABLE command in SQL. Right now, we just 
> throw exception (I was not sure how often people will use it). Since it is a 
> pretty standalone work (generating a CREATE TABLE command based on the 
> metadata of a table) and people may find it pretty useful, I am thinking to 
> get it in 2.0. Hive's implementation can be found at 
> https://github.com/apache/hive/blob/release-1.2.1/ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java#L1898-L2126.
>  The main difference for spark is that if we have a data source table, we 
> should use Spark's syntax (CREATE TABLE ... USING ... OPTIONS) instead of 
> Hive's syntax.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14347) Require Java 8 for Spark 2.x

2016-04-02 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-14347:
--
Issue Type: Sub-task  (was: Task)
Parent: SPARK-11806

> Require Java 8 for Spark 2.x
> 
>
> Key: SPARK-14347
> URL: https://issues.apache.org/jira/browse/SPARK-14347
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib, Spark Core, SQL, Streaming
>Affects Versions: 2.0.0
>Reporter: Sean Owen
>
> Putting this down as a JIRA to advance the discussion -- I think this is far 
> enough along to consensus for that.
> The change here is to require Java 8. This means:
> - Require Java 8 in the build
> - Only build and test with Java 8, removing other older Jenkins configs
> - Remove MaxPermSize
> - Remove reflection to use Java 8-only methods
> - Move external/java8-tests to core/streaming and remove profile
> And optionally:
> - Update all Java 8 code to take advantage of 8+ features, like lambdas, for 
> simplification



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-14347) Require Java 8 for Spark 2.x

2016-04-02 Thread Sean Owen (JIRA)

Sean Owen created SPARK-14347:
-

 Summary: Require Java 8 for Spark 2.x
 Key: SPARK-14347
 URL: https://issues.apache.org/jira/browse/SPARK-14347
 Project: Spark
  Issue Type: Task
  Components: MLlib, Spark Core, SQL, Streaming
Affects Versions: 2.0.0
Reporter: Sean Owen


Putting this down as a JIRA to advance the discussion -- I think this is far 
enough along to consensus for that.

The change here is to require Java 8. This means:

- Require Java 8 in the build
- Only build and test with Java 8, removing other older Jenkins configs
- Remove MaxPermSize
- Remove reflection to use Java 8-only methods
- Move external/java8-tests to core/streaming and remove profile

And optionally:

- Update all Java 8 code to take advantage of 8+ features, like lambdas, for 
simplification



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-14348) Support native execution of SHOW DATABASE command

2016-04-02 Thread Dilip Biswal (JIRA)

Dilip Biswal created SPARK-14348:


 Summary: Support native execution of SHOW DATABASE command
 Key: SPARK-14348
 URL: https://issues.apache.org/jira/browse/SPARK-14348
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Dilip Biswal


1. Support parsing of SHOW TBLPROPERTIES command
2. Support the native execution of SHOW TBLPROPERTIES command

The syntax for SHOW commands are described in following link:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-ShowTables



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14348) Support native execution of SHOW TBLPROPERTIES command

2016-04-02 Thread Dilip Biswal (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dilip Biswal updated SPARK-14348:
-
Summary: Support native execution of SHOW TBLPROPERTIES command  (was: 
Support native execution of SHOW DATABASE command)

> Support native execution of SHOW TBLPROPERTIES command
> --
>
> Key: SPARK-14348
> URL: https://issues.apache.org/jira/browse/SPARK-14348
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Dilip Biswal
>
> 1. Support parsing of SHOW TBLPROPERTIES command
> 2. Support the native execution of SHOW TBLPROPERTIES command
> The syntax for SHOW commands are described in following link:
> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-ShowTables



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14348) Support native execution of SHOW TBLPROPERTIES command

2016-04-02 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14348:


Assignee: (was: Apache Spark)

> Support native execution of SHOW TBLPROPERTIES command
> --
>
> Key: SPARK-14348
> URL: https://issues.apache.org/jira/browse/SPARK-14348
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Dilip Biswal
>
> 1. Support parsing of SHOW TBLPROPERTIES command
> 2. Support the native execution of SHOW TBLPROPERTIES command
> The syntax for SHOW commands are described in following link:
> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-ShowTables



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14348) Support native execution of SHOW TBLPROPERTIES command

2016-04-02 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15223034#comment-15223034
 ] 

Apache Spark commented on SPARK-14348:
--

User 'dilipbiswal' has created a pull request for this issue:
https://github.com/apache/spark/pull/12133

> Support native execution of SHOW TBLPROPERTIES command
> --
>
> Key: SPARK-14348
> URL: https://issues.apache.org/jira/browse/SPARK-14348
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Dilip Biswal
>
> 1. Support parsing of SHOW TBLPROPERTIES command
> 2. Support the native execution of SHOW TBLPROPERTIES command
> The syntax for SHOW commands are described in following link:
> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-ShowTables



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14348) Support native execution of SHOW TBLPROPERTIES command

2016-04-02 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14348:


Assignee: Apache Spark

> Support native execution of SHOW TBLPROPERTIES command
> --
>
> Key: SPARK-14348
> URL: https://issues.apache.org/jira/browse/SPARK-14348
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Dilip Biswal
>Assignee: Apache Spark
>
> 1. Support parsing of SHOW TBLPROPERTIES command
> 2. Support the native execution of SHOW TBLPROPERTIES command
> The syntax for SHOW commands are described in following link:
> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-ShowTables



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12675) Executor dies because of ClassCastException and causes timeout

2016-04-02 Thread Matt Butler (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15223080#comment-15223080
 ] 

Matt Butler commented on SPARK-12675:
-

We are hitting this as well. Exact same stack trace as Anthony Brew above. 
Spark 1.6.0, local mode. I can reproduce it at will in my codebase, but can't 
share. Will try and narrow it down further. Very small number of partitions (on 
the order of 2 or 4)

I don't mind the exception so much it's the fact that the workers are deemed 
dead as they are no longer sending heartbeats.

> Executor dies because of ClassCastException and causes timeout
> --
>
> Key: SPARK-12675
> URL: https://issues.apache.org/jira/browse/SPARK-12675
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0, 2.0.0
> Environment: 64-bit Linux Ubuntu 15.10, 16GB RAM, 8 cores 3ghz
>Reporter: Alexandru Rosianu
>Priority: Minor
>
> I'm trying to fit a Spark ML pipeline but my executor dies. Here's the script 
> which doesn't work (a bit simplified):
> {code:title=Script.scala}
> // Prepare data sets
> logInfo("Getting datasets")
> val emoTrainingData = 
> sqlc.read.parquet("/tw/sentiment/emo/parsed/data.parquet")
> val trainingData = emoTrainingData
> // Configure the pipeline
> val pipeline = new Pipeline().setStages(Array(
>   new 
> FeatureReducer().setInputCol("raw_text").setOutputCol("reduced_text"),
>   new StringSanitizer().setInputCol("reduced_text").setOutputCol("text"),
>   new Tokenizer().setInputCol("text").setOutputCol("raw_words"),
>   new StopWordsRemover().setInputCol("raw_words").setOutputCol("words"),
>   new HashingTF().setInputCol("words").setOutputCol("features"),
>   new NaiveBayes().setSmoothing(0.5).setFeaturesCol("features"),
>   new ColumnDropper().setDropColumns("raw_text", "reduced_text", "text", 
> "raw_words", "words", "features")
> ))
> // Fit the pipeline
> logInfo(s"Training model on ${trainingData.count()} rows")
> val model = pipeline.fit(trainingData)
> {code}
> It executes up to the last line. It prints "Training model on xx rows", then 
> it starts fitting, the executor dies, the drivers doesn't receive heartbeats 
> from the executor and it times out, then the script exits. It doesn't get 
> past that line.
> This is the exception that kills the executor:
> {code}
> java.io.IOException: java.lang.ClassCastException: cannot assign instance 
> of scala.collection.immutable.HashMap$SerializationProxy to field 
> org.apache.spark.executor.TaskMetrics._accumulatorUpdates of type 
> scala.collection.immutable.Map in instance of 
> org.apache.spark.executor.TaskMetrics
>   at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1207)
>   at 
> org.apache.spark.executor.TaskMetrics.readObject(TaskMetrics.scala:219)
>   at sun.reflect.GeneratedMethodAccessor15.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:497)
>   at 
> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1058)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1900)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
>   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
>   at org.apache.spark.util.Utils$.deserialize(Utils.scala:92)
>   at 
> org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$reportHeartBeat$1$$anonfun$apply$6.apply(Executor.scala:436)
>   at 
> org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$reportHeartBeat$1$$anonfun$apply$6.apply(Executor.scala:426)
>   at scala.Option.foreach(Option.scala:257)
>   at 
> org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$reportHeartBeat$1.apply(Executor.scala:426)
>   at 
> org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$reportHeartBeat$1.apply(Executor.scala:424)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:742)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1194)
>   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
>   at 
> org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:424)
>   at 
> org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply$mcV$sp(Executor.scala:468)
>   at 
> org.apache.spark.executor.Executor

[jira] [Created] (SPARK-14349) Issue Error Messages for Unsupported Operations in SQL Context.

2016-04-02 Thread Xiao Li (JIRA)

Xiao Li created SPARK-14349:
---

 Summary: Issue Error Messages for Unsupported Operations in SQL 
Context.
 Key: SPARK-14349
 URL: https://issues.apache.org/jira/browse/SPARK-14349
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
Reporter: Xiao Li


Currently, the weird error messages are issued if we use Hive Context-only 
operations in SQL Context. 

For example, 

1. When calling `Drop Table` in SQL Context, we got the following message:
{{{
Expected exception org.apache.spark.sql.catalyst.parser.ParseException to be 
thrown, but java.lang.ClassCastException was thrown.
}}}
2. When calling `Script Transform` in SQL Context, we got the message:
{{{
assertion failed: No plan for ScriptTransformation [key#9,value#10], cat, 
[tKey#155,tValue#156], null
+- LogicalRDD [key#9,value#10], MapPartitionsRDD[3] at beforeAll at 
BeforeAndAfterAll.scala:187
}}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14349) Issue Error Messages for Unsupported Operations in SQL Context.

2016-04-02 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-14349:

Description: 
Currently, the weird error messages are issued if we use Hive Context-only 
operations in SQL Context. 

For example, 

1. When calling `Drop Table` in SQL Context, we got the following message:

Expected exception org.apache.spark.sql.catalyst.parser.ParseException to be 
thrown, but java.lang.ClassCastException was thrown.

2. When calling `Script Transform` in SQL Context, we got the message:

assertion failed: No plan for ScriptTransformation [key#9,value#10], cat, 
[tKey#155,tValue#156], null
+- LogicalRDD [key#9,value#10], MapPartitionsRDD[3] at beforeAll at 
BeforeAndAfterAll.scala:187


  was:
Currently, the weird error messages are issued if we use Hive Context-only 
operations in SQL Context. 

For example, 

1. When calling `Drop Table` in SQL Context, we got the following message:
{{{
Expected exception org.apache.spark.sql.catalyst.parser.ParseException to be 
thrown, but java.lang.ClassCastException was thrown.
}}}
2. When calling `Script Transform` in SQL Context, we got the message:
{{{
assertion failed: No plan for ScriptTransformation [key#9,value#10], cat, 
[tKey#155,tValue#156], null
+- LogicalRDD [key#9,value#10], MapPartitionsRDD[3] at beforeAll at 
BeforeAndAfterAll.scala:187
}}}


> Issue Error Messages for Unsupported Operations in SQL Context.
> ---
>
> Key: SPARK-14349
> URL: https://issues.apache.org/jira/browse/SPARK-14349
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>
> Currently, the weird error messages are issued if we use Hive Context-only 
> operations in SQL Context. 
> For example, 
> 1. When calling `Drop Table` in SQL Context, we got the following message:
> Expected exception org.apache.spark.sql.catalyst.parser.ParseException to be 
> thrown, but java.lang.ClassCastException was thrown.
> 2. When calling `Script Transform` in SQL Context, we got the message:
> assertion failed: No plan for ScriptTransformation [key#9,value#10], cat, 
> [tKey#155,tValue#156], null
> +- LogicalRDD [key#9,value#10], MapPartitionsRDD[3] at beforeAll at 
> BeforeAndAfterAll.scala:187



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14349) Issue Error Messages for Unsupported Operations in SQL Context.

2016-04-02 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14349:


Assignee: (was: Apache Spark)

> Issue Error Messages for Unsupported Operations in SQL Context.
> ---
>
> Key: SPARK-14349
> URL: https://issues.apache.org/jira/browse/SPARK-14349
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>
> Currently, the weird error messages are issued if we use Hive Context-only 
> operations in SQL Context. 
> For example, 
> 1. When calling `Drop Table` in SQL Context, we got the following message:
> Expected exception org.apache.spark.sql.catalyst.parser.ParseException to be 
> thrown, but java.lang.ClassCastException was thrown.
> 2. When calling `Script Transform` in SQL Context, we got the message:
> assertion failed: No plan for ScriptTransformation [key#9,value#10], cat, 
> [tKey#155,tValue#156], null
> +- LogicalRDD [key#9,value#10], MapPartitionsRDD[3] at beforeAll at 
> BeforeAndAfterAll.scala:187



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14349) Issue Error Messages for Unsupported Operations in SQL Context.

2016-04-02 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14349:


Assignee: Apache Spark

> Issue Error Messages for Unsupported Operations in SQL Context.
> ---
>
> Key: SPARK-14349
> URL: https://issues.apache.org/jira/browse/SPARK-14349
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>Assignee: Apache Spark
>
> Currently, the weird error messages are issued if we use Hive Context-only 
> operations in SQL Context. 
> For example, 
> 1. When calling `Drop Table` in SQL Context, we got the following message:
> Expected exception org.apache.spark.sql.catalyst.parser.ParseException to be 
> thrown, but java.lang.ClassCastException was thrown.
> 2. When calling `Script Transform` in SQL Context, we got the message:
> assertion failed: No plan for ScriptTransformation [key#9,value#10], cat, 
> [tKey#155,tValue#156], null
> +- LogicalRDD [key#9,value#10], MapPartitionsRDD[3] at beforeAll at 
> BeforeAndAfterAll.scala:187



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14349) Issue Error Messages for Unsupported Operations in SQL Context.

2016-04-02 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15223081#comment-15223081
 ] 

Apache Spark commented on SPARK-14349:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/12134

> Issue Error Messages for Unsupported Operations in SQL Context.
> ---
>
> Key: SPARK-14349
> URL: https://issues.apache.org/jira/browse/SPARK-14349
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>
> Currently, the weird error messages are issued if we use Hive Context-only 
> operations in SQL Context. 
> For example, 
> 1. When calling `Drop Table` in SQL Context, we got the following message:
> Expected exception org.apache.spark.sql.catalyst.parser.ParseException to be 
> thrown, but java.lang.ClassCastException was thrown.
> 2. When calling `Script Transform` in SQL Context, we got the message:
> assertion failed: No plan for ScriptTransformation [key#9,value#10], cat, 
> [tKey#155,tValue#156], null
> +- LogicalRDD [key#9,value#10], MapPartitionsRDD[3] at beforeAll at 
> BeforeAndAfterAll.scala:187



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12675) Executor dies because of ClassCastException and causes timeout

2016-04-02 Thread Matt Butler (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15223082#comment-15223082
 ] 

Matt Butler commented on SPARK-12675:
-

Tried to recreate, now new problem in heartbeater

2016-04-02 20:06:52,004 ERROR [driver-heartbeater] org.apache.spark.util.Utils 
[Logging.scala:95] - Uncaught exception in thread driver-heartbeater
java.io.IOException: java.lang.ClassNotFoundException: 
org.apache.spark.storage.RDDBlockId
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1207) 
~[com.privacyanalytics.spark-1.6.0.jar:na]
at 
org.apache.spark.executor.TaskMetrics.readObject(TaskMetrics.scala:219) 
~[com.privacyanalytics.spark-1.6.0.jar:na]
at sun.reflect.GeneratedMethodAccessor139.invoke(Unknown Source) 
~[na:na]
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 ~[na:1.8.0_25]
at java.lang.reflect.Method.invoke(Method.java:483) ~[na:1.8.0_25]
at 
java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017) 
~[na:1.8.0_25]
at 
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1896) 
~[na:1.8.0_25]
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801) 
~[na:1.8.0_25]
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351) 
~[na:1.8.0_25]
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371) 
~[na:1.8.0_25]
at org.apache.spark.util.Utils$.deserialize(Utils.scala:92) 
~[com.privacyanalytics.spark-1.6.0.jar:na]
at 
org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$reportHeartBeat$1$$anonfun$apply$6.apply(Executor.scala:436)
 ~[com.privacyanalytics.spark-1.6.0.jar:na]
at 
org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$reportHeartBeat$1$$anonfun$apply$6.apply(Executor.scala:426)
 ~[com.privacyanalytics.spark-1.6.0.jar:na]
at scala.Option.foreach(Option.scala:236) 
~[org.scala-lang.scala-library-2.10.6.jar:na]
at 
org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$reportHeartBeat$1.apply(Executor.scala:426)
 ~[com.privacyanalytics.spark-1.6.0.jar:na]
at 
org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$reportHeartBeat$1.apply(Executor.scala:424)
 ~[com.privacyanalytics.spark-1.6.0.jar:na]
at scala.collection.Iterator$class.foreach(Iterator.scala:727) 
~[org.scala-lang.scala-library-2.10.6.jar:na]
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) 
~[org.scala-lang.scala-library-2.10.6.jar:na]
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) 
~[org.scala-lang.scala-library-2.10.6.jar:na]
at scala.collection.AbstractIterable.foreach(Iterable.scala:54) 
~[org.scala-lang.scala-library-2.10.6.jar:na]
at 
org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:424)
 ~[com.privacyanalytics.spark-1.6.0.jar:na]
at 
org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply$mcV$sp(Executor.scala:468)
 ~[com.privacyanalytics.spark-1.6.0.jar:na]
at 
org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:468)
 ~[com.privacyanalytics.spark-1.6.0.jar:na]
at 
org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:468)
 ~[com.privacyanalytics.spark-1.6.0.jar:na]
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1741) 
~[com.privacyanalytics.spark-1.6.0.jar:na]
at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:468) 
[com.privacyanalytics.spark-1.6.0.jar:na]
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) 
[na:1.8.0_25]
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) 
[na:1.8.0_25]
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
 [na:1.8.0_25]
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
 [na:1.8.0_25]
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) 
[na:1.8.0_25]
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) 
[na:1.8.0_25]
at java.lang.Thread.run(Thread.java:745) [na:1.8.0_25]

> Executor dies because of ClassCastException and causes timeout
> --
>
> Key: SPARK-12675
> URL: https://issues.apache.org/jira/browse/SPARK-12675
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0, 2.0.0
> Environment: 64-bit Linux Ubuntu 15.10, 1

[jira] [Issue Comment Deleted] (SPARK-12675) Executor dies because of ClassCastException and causes timeout

2016-04-02 Thread Matt Butler (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matt Butler updated SPARK-12675:

Comment: was deleted

(was: Tried to recreate, now new problem in heartbeater

2016-04-02 20:06:52,004 ERROR [driver-heartbeater] org.apache.spark.util.Utils 
[Logging.scala:95] - Uncaught exception in thread driver-heartbeater
java.io.IOException: java.lang.ClassNotFoundException: 
org.apache.spark.storage.RDDBlockId
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1207) 
~[com.privacyanalytics.spark-1.6.0.jar:na]
at 
org.apache.spark.executor.TaskMetrics.readObject(TaskMetrics.scala:219) 
~[com.privacyanalytics.spark-1.6.0.jar:na]
at sun.reflect.GeneratedMethodAccessor139.invoke(Unknown Source) 
~[na:na]
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 ~[na:1.8.0_25]
at java.lang.reflect.Method.invoke(Method.java:483) ~[na:1.8.0_25]
at 
java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017) 
~[na:1.8.0_25]
at 
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1896) 
~[na:1.8.0_25]
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801) 
~[na:1.8.0_25]
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351) 
~[na:1.8.0_25]
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371) 
~[na:1.8.0_25]
at org.apache.spark.util.Utils$.deserialize(Utils.scala:92) 
~[com.privacyanalytics.spark-1.6.0.jar:na]
at 
org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$reportHeartBeat$1$$anonfun$apply$6.apply(Executor.scala:436)
 ~[com.privacyanalytics.spark-1.6.0.jar:na]
at 
org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$reportHeartBeat$1$$anonfun$apply$6.apply(Executor.scala:426)
 ~[com.privacyanalytics.spark-1.6.0.jar:na]
at scala.Option.foreach(Option.scala:236) 
~[org.scala-lang.scala-library-2.10.6.jar:na]
at 
org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$reportHeartBeat$1.apply(Executor.scala:426)
 ~[com.privacyanalytics.spark-1.6.0.jar:na]
at 
org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$reportHeartBeat$1.apply(Executor.scala:424)
 ~[com.privacyanalytics.spark-1.6.0.jar:na]
at scala.collection.Iterator$class.foreach(Iterator.scala:727) 
~[org.scala-lang.scala-library-2.10.6.jar:na]
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) 
~[org.scala-lang.scala-library-2.10.6.jar:na]
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) 
~[org.scala-lang.scala-library-2.10.6.jar:na]
at scala.collection.AbstractIterable.foreach(Iterable.scala:54) 
~[org.scala-lang.scala-library-2.10.6.jar:na]
at 
org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:424)
 ~[com.privacyanalytics.spark-1.6.0.jar:na]
at 
org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply$mcV$sp(Executor.scala:468)
 ~[com.privacyanalytics.spark-1.6.0.jar:na]
at 
org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:468)
 ~[com.privacyanalytics.spark-1.6.0.jar:na]
at 
org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:468)
 ~[com.privacyanalytics.spark-1.6.0.jar:na]
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1741) 
~[com.privacyanalytics.spark-1.6.0.jar:na]
at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:468) 
[com.privacyanalytics.spark-1.6.0.jar:na]
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) 
[na:1.8.0_25]
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) 
[na:1.8.0_25]
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
 [na:1.8.0_25]
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
 [na:1.8.0_25]
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) 
[na:1.8.0_25]
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) 
[na:1.8.0_25]
at java.lang.Thread.run(Thread.java:745) [na:1.8.0_25])

> Executor dies because of ClassCastException and causes timeout
> --
>
> Key: SPARK-12675
> URL: https://issues.apache.org/jira/browse/SPARK-12675
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0, 2.0.0
> Environment: 64-bit Linux Ubuntu 15.10, 16GB RAM, 8 cores 3ghz
>

[jira] [Comment Edited] (SPARK-12675) Executor dies because of ClassCastException and causes timeout

2016-04-02 Thread Matt Butler (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15223080#comment-15223080
 ] 

Matt Butler edited comment on SPARK-12675 at 4/3/16 12:31 AM:
--

We are hitting this as well. Exact same stack trace as Anthony Brew above. 
Spark 1.6.0, local mode. I can reproduce it at will in my codebase, but can't 
share. Will try and narrow it down further. Very small number of partitions (on 
the order of 2 or 4)

I don't mind the exception so much it's the fact that the workers are deemed 
dead as they are no longer sending heartbeats.

Tried to recreate and now see different stack trace.

2016-04-02 20:06:52,004 ERROR [driver-heartbeater] org.apache.spark.util.Utils 
[Logging.scala:95] - Uncaught exception in thread driver-heartbeater
java.io.IOException: java.lang.ClassNotFoundException: 
org.apache.spark.storage.RDDBlockId
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1207) 
~[com.privacyanalytics.spark-1.6.0.jar:na]
at 
org.apache.spark.executor.TaskMetrics.readObject(TaskMetrics.scala:219) 
~[com.privacyanalytics.spark-1.6.0.jar:na]
at sun.reflect.GeneratedMethodAccessor139.invoke(Unknown Source) 
~[na:na]
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 ~[na:1.8.0_25]
at java.lang.reflect.Method.invoke(Method.java:483) ~[na:1.8.0_25]
at 
java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017) 
~[na:1.8.0_25]
at 
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1896) 
~[na:1.8.0_25]
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801) 
~[na:1.8.0_25]
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351) 
~[na:1.8.0_25]
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371) 
~[na:1.8.0_25]
at org.apache.spark.util.Utils$.deserialize(Utils.scala:92) 
~[com.privacyanalytics.spark-1.6.0.jar:na]
at 
org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$reportHeartBeat$1$$anonfun$apply$6.apply(Executor.scala:436)
 ~[com.privacyanalytics.spark-1.6.0.jar:na]
at 
org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$reportHeartBeat$1$$anonfun$apply$6.apply(Executor.scala:426)
 ~[com.privacyanalytics.spark-1.6.0.jar:na]
at scala.Option.foreach(Option.scala:236) 
~[org.scala-lang.scala-library-2.10.6.jar:na]
at 
org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$reportHeartBeat$1.apply(Executor.scala:426)
 ~[com.privacyanalytics.spark-1.6.0.jar:na]
at 
org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$reportHeartBeat$1.apply(Executor.scala:424)
 ~[com.privacyanalytics.spark-1.6.0.jar:na]
at scala.collection.Iterator$class.foreach(Iterator.scala:727) 
~[org.scala-lang.scala-library-2.10.6.jar:na]
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) 
~[org.scala-lang.scala-library-2.10.6.jar:na]
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) 
~[org.scala-lang.scala-library-2.10.6.jar:na]
at scala.collection.AbstractIterable.foreach(Iterable.scala:54) 
~[org.scala-lang.scala-library-2.10.6.jar:na]
at 
org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:424)
 ~[com.privacyanalytics.spark-1.6.0.jar:na]
at 
org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply$mcV$sp(Executor.scala:468)
 ~[com.privacyanalytics.spark-1.6.0.jar:na]
at 
org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:468)
 ~[com.privacyanalytics.spark-1.6.0.jar:na]
at 
org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:468)
 ~[com.privacyanalytics.spark-1.6.0.jar:na]
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1741) 
~[com.privacyanalytics.spark-1.6.0.jar:na]
at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:468) 
[com.privacyanalytics.spark-1.6.0.jar:na]
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) 
[na:1.8.0_25]
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) 
[na:1.8.0_25]



Line 437 or Executor makes this very very strange.We were able to serialize the 
metrics, but can no longer deserialize them ?


was (Author: matt.s.butler):
We are hitting this as well. Exact same stack trace as Anthony Brew above. 
Spark 1.6.0, local mode. I can reproduce it at will in my codebase, but can't 
share. Will try and narrow it down further. Very small number of partitions (on 
the order of 2 or 4)

I don't mind the exception so much it's the fact that the workers are deemed 
dead as they are no longer sending heartbeats

[jira] [Comment Edited] (SPARK-12675) Executor dies because of ClassCastException and causes timeout

2016-04-02 Thread Matt Butler (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15223080#comment-15223080
 ] 

Matt Butler edited comment on SPARK-12675 at 4/3/16 12:32 AM:
--

We are hitting this as well. Exact same stack trace as Anthony Brew above. 
Spark 1.6.0, local mode. I can reproduce it at will in my codebase, but can't 
share. Will try and narrow it down further. Very small number of partitions (on 
the order of 2 or 4)

I don't mind the exception so much it's the fact that the workers are deemed 
dead as they are no longer sending heartbeats.

Tried to recreate and now see different stack trace. (below)

Line 437 or Executor makes this very very strange.We were able to serialize the 
metrics, but can no longer deserialize them ?


2016-04-02 20:06:52,004 ERROR [driver-heartbeater] org.apache.spark.util.Utils 
[Logging.scala:95] - Uncaught exception in thread driver-heartbeater
java.io.IOException: java.lang.ClassNotFoundException: 
org.apache.spark.storage.RDDBlockId
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1207) 
~[com.privacyanalytics.spark-1.6.0.jar:na]
at 
org.apache.spark.executor.TaskMetrics.readObject(TaskMetrics.scala:219) 
~[com.privacyanalytics.spark-1.6.0.jar:na]
at sun.reflect.GeneratedMethodAccessor139.invoke(Unknown Source) 
~[na:na]
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 ~[na:1.8.0_25]
at java.lang.reflect.Method.invoke(Method.java:483) ~[na:1.8.0_25]
at 
java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017) 
~[na:1.8.0_25]
at 
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1896) 
~[na:1.8.0_25]
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801) 
~[na:1.8.0_25]
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351) 
~[na:1.8.0_25]
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371) 
~[na:1.8.0_25]
at org.apache.spark.util.Utils$.deserialize(Utils.scala:92) 
~[com.privacyanalytics.spark-1.6.0.jar:na]
at 
org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$reportHeartBeat$1$$anonfun$apply$6.apply(Executor.scala:436)
 ~[com.privacyanalytics.spark-1.6.0.jar:na]
at 
org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$reportHeartBeat$1$$anonfun$apply$6.apply(Executor.scala:426)
 ~[com.privacyanalytics.spark-1.6.0.jar:na]
at scala.Option.foreach(Option.scala:236) 
~[org.scala-lang.scala-library-2.10.6.jar:na]
at 
org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$reportHeartBeat$1.apply(Executor.scala:426)
 ~[com.privacyanalytics.spark-1.6.0.jar:na]
at 
org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$reportHeartBeat$1.apply(Executor.scala:424)
 ~[com.privacyanalytics.spark-1.6.0.jar:na]
at scala.collection.Iterator$class.foreach(Iterator.scala:727) 
~[org.scala-lang.scala-library-2.10.6.jar:na]
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) 
~[org.scala-lang.scala-library-2.10.6.jar:na]
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) 
~[org.scala-lang.scala-library-2.10.6.jar:na]
at scala.collection.AbstractIterable.foreach(Iterable.scala:54) 
~[org.scala-lang.scala-library-2.10.6.jar:na]
at 
org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:424)
 ~[com.privacyanalytics.spark-1.6.0.jar:na]
at 
org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply$mcV$sp(Executor.scala:468)
 ~[com.privacyanalytics.spark-1.6.0.jar:na]
at 
org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:468)
 ~[com.privacyanalytics.spark-1.6.0.jar:na]
at 
org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:468)
 ~[com.privacyanalytics.spark-1.6.0.jar:na]
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1741) 
~[com.privacyanalytics.spark-1.6.0.jar:na]
at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:468) 
[com.privacyanalytics.spark-1.6.0.jar:na]
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) 
[na:1.8.0_25]
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) 
[na:1.8.0_25]






was (Author: matt.s.butler):
We are hitting this as well. Exact same stack trace as Anthony Brew above. 
Spark 1.6.0, local mode. I can reproduce it at will in my codebase, but can't 
share. Will try and narrow it down further. Very small number of partitions (on 
the order of 2 or 4)

I don't mind the exception so much it's the fact that the workers are deemed 
dead as they are no longer sending

[jira] [Comment Edited] (SPARK-12675) Executor dies because of ClassCastException and causes timeout

2016-04-02 Thread Matt Butler (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15223080#comment-15223080
 ] 

Matt Butler edited comment on SPARK-12675 at 4/3/16 12:32 AM:
--

We are hitting this as well. Exact same stack trace as Anthony Brew above. 
Spark 1.6.0, local mode. I can reproduce it at will in my codebase, but can't 
share. Will try and narrow it down further. Very small number of partitions (on 
the order of 2 or 4)

I don't mind the exception so much it's the fact that the workers are deemed 
dead as they are no longer sending heartbeats.

Tried to recreate and now see different stack trace. (below)

Line 437 of Executor makes this very very strange.We were able to serialize the 
metrics, but can no longer deserialize them ?





2016-04-02 20:06:52,004 ERROR [driver-heartbeater] org.apache.spark.util.Utils 
[Logging.scala:95] - Uncaught exception in thread driver-heartbeater
java.io.IOException: java.lang.ClassNotFoundException: 
org.apache.spark.storage.RDDBlockId
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1207) 
~[com.privacyanalytics.spark-1.6.0.jar:na]
at 
org.apache.spark.executor.TaskMetrics.readObject(TaskMetrics.scala:219) 
~[com.privacyanalytics.spark-1.6.0.jar:na]
at sun.reflect.GeneratedMethodAccessor139.invoke(Unknown Source) 
~[na:na]
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 ~[na:1.8.0_25]
at java.lang.reflect.Method.invoke(Method.java:483) ~[na:1.8.0_25]
at 
java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017) 
~[na:1.8.0_25]
at 
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1896) 
~[na:1.8.0_25]
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801) 
~[na:1.8.0_25]
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351) 
~[na:1.8.0_25]
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371) 
~[na:1.8.0_25]
at org.apache.spark.util.Utils$.deserialize(Utils.scala:92) 
~[com.privacyanalytics.spark-1.6.0.jar:na]
at 
org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$reportHeartBeat$1$$anonfun$apply$6.apply(Executor.scala:436)
 ~[com.privacyanalytics.spark-1.6.0.jar:na]
at 
org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$reportHeartBeat$1$$anonfun$apply$6.apply(Executor.scala:426)
 ~[com.privacyanalytics.spark-1.6.0.jar:na]
at scala.Option.foreach(Option.scala:236) 
~[org.scala-lang.scala-library-2.10.6.jar:na]
at 
org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$reportHeartBeat$1.apply(Executor.scala:426)
 ~[com.privacyanalytics.spark-1.6.0.jar:na]
at 
org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$reportHeartBeat$1.apply(Executor.scala:424)
 ~[com.privacyanalytics.spark-1.6.0.jar:na]
at scala.collection.Iterator$class.foreach(Iterator.scala:727) 
~[org.scala-lang.scala-library-2.10.6.jar:na]
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) 
~[org.scala-lang.scala-library-2.10.6.jar:na]
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) 
~[org.scala-lang.scala-library-2.10.6.jar:na]
at scala.collection.AbstractIterable.foreach(Iterable.scala:54) 
~[org.scala-lang.scala-library-2.10.6.jar:na]
at 
org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:424)
 ~[com.privacyanalytics.spark-1.6.0.jar:na]
at 
org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply$mcV$sp(Executor.scala:468)
 ~[com.privacyanalytics.spark-1.6.0.jar:na]
at 
org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:468)
 ~[com.privacyanalytics.spark-1.6.0.jar:na]
at 
org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:468)
 ~[com.privacyanalytics.spark-1.6.0.jar:na]
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1741) 
~[com.privacyanalytics.spark-1.6.0.jar:na]
at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:468) 
[com.privacyanalytics.spark-1.6.0.jar:na]
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) 
[na:1.8.0_25]
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) 
[na:1.8.0_25]






was (Author: matt.s.butler):
We are hitting this as well. Exact same stack trace as Anthony Brew above. 
Spark 1.6.0, local mode. I can reproduce it at will in my codebase, but can't 
share. Will try and narrow it down further. Very small number of partitions (on 
the order of 2 or 4)

I don't mind the exception so much it's the fact that the workers are deemed 
dead as they are no longer send

[jira] [Resolved] (SPARK-14338) Improve `SimplifyConditionals` rule to handle `null` in IF/CASEWHEN

2016-04-02 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-14338.
-
   Resolution: Fixed
 Assignee: Dongjoon Hyun
Fix Version/s: 2.0.0

> Improve `SimplifyConditionals` rule to handle `null` in IF/CASEWHEN
> ---
>
> Key: SPARK-14338
> URL: https://issues.apache.org/jira/browse/SPARK-14338
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer, SQL
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
> Fix For: 2.0.0
>
>
> Currently, `SimplifyConditionals` handles `true` and `false` to optimize 
> branches. This issue improves `SimplifyConditionals` to take advantage of 
> `null` conditions for `if` and `CaseWhen` expressions, too.
> *Before*
> {code}
> scala> sql("SELECT IF(null, 1, 0)").explain()
> == Physical Plan ==
> WholeStageCodegen
> :  +- Project [if (null) 1 else 0 AS (IF(CAST(NULL AS BOOLEAN), 1, 0))#4]
> : +- INPUT
> +- Scan OneRowRelation[]
> scala> sql("select case when cast(null as boolean) then 1 else 2 
> end").explain()
> == Physical Plan ==
> WholeStageCodegen
> :  +- Project [CASE WHEN null THEN 1 ELSE 2 END AS CASE WHEN CAST(NULL AS 
> BOOLEAN) THEN 1 ELSE 2 END#14]
> : +- INPUT
> +- Scan OneRowRelation[]
> {code}
> *After*
> {code}
> scala> sql("SELECT IF(null, 1, 0)").explain()
> == Physical Plan ==
> WholeStageCodegen
> :  +- Project [0 AS (IF(CAST(NULL AS BOOLEAN), 1, 0))#4]
> : +- INPUT
> +- Scan OneRowRelation[]
> scala> sql("select case when cast(null as boolean) then 1 else 2 
> end").explain()
> == Physical Plan ==
> WholeStageCodegen
> :  +- Project [2 AS CASE WHEN CAST(NULL AS BOOLEAN) THEN 1 ELSE 2 END#4]
> : +- INPUT
> +- Scan OneRowRelation[]
> {code}
> *Hive*
> {code}
> hive> select if(null,1,2);
> OK
> 2
> hive> select case when cast(null as boolean) then 1 else 2 end;
> OK
> 2
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-13877) Consider removing Kafka modules from Spark / Spark Streaming

2016-04-02 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin closed SPARK-13877.
---
  Resolution: Won't Fix
Target Version/s:   (was: 2.0.0)

Closing as won't fix since we are keeping Kafka in Spark.


> Consider removing Kafka modules from Spark / Spark Streaming
> 
>
> Key: SPARK-13877
> URL: https://issues.apache.org/jira/browse/SPARK-13877
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, Streaming
>Affects Versions: 1.6.1
>Reporter: Hari Shreedharan
>
> Based on the discussion the PR for SPARK-13843 
> ([here|https://github.com/apache/spark/pull/11672#issuecomment-196553283]), 
> we should consider moving the Kafka modules out of Spark as well. 
> Providing newer functionality (like security) has become painful while 
> maintaining compatibility with older versions of Kafka. Moving this out 
> allows more flexibility, allowing users to mix and match Kafka and Spark 
> versions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-14342) Remove straggler references to Tachyon

2016-04-02 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-14342.
-
   Resolution: Fixed
 Assignee: Liwei Lin
Fix Version/s: 2.0.0

> Remove straggler references to Tachyon
> --
>
> Key: SPARK-14342
> URL: https://issues.apache.org/jira/browse/SPARK-14342
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, Spark Core, Tests
>Affects Versions: 2.0.0
>Reporter: Liwei Lin
>Assignee: Liwei Lin
>Priority: Minor
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14350) explain output should be in a single cell rather than one line per cell

2016-04-02 Thread Reynold Xin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15223087#comment-15223087
 ] 

Reynold Xin commented on SPARK-14350:
-

cc [~dongjoon] want to try fix this?


> explain output should be in a single cell rather than one line per cell
> ---
>
> Key: SPARK-14350
> URL: https://issues.apache.org/jira/browse/SPARK-14350
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Reynold Xin
>
> See 
> {code}
> scala> sql("explain select 1").head
> res3: org.apache.spark.sql.Row = [== Physical Plan ==]
> {code}
> We should show the entire output, rather than just the 1st line, when head is 
> used. That is to say, the output should contain only one row, rather than one 
> row per line.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-14350) explain output should be in a single cell rather than one line per cell

2016-04-02 Thread Reynold Xin (JIRA)

Reynold Xin created SPARK-14350:
---

 Summary: explain output should be in a single cell rather than one 
line per cell
 Key: SPARK-14350
 URL: https://issues.apache.org/jira/browse/SPARK-14350
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Reynold Xin


See 

{code}
scala> sql("explain select 1").head
res3: org.apache.spark.sql.Row = [== Physical Plan ==]
{code}

We should show the entire output, rather than just the 1st line, when head is 
used. That is to say, the output should contain only one row, rather than one 
row per line.






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14056) Add s3 configurations and spark.hadoop.* configurations to hive configuration

2016-04-02 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-14056:
--
Priority: Minor  (was: Major)

> Add s3 configurations and spark.hadoop.* configurations to hive configuration
> -
>
> Key: SPARK-14056
> URL: https://issues.apache.org/jira/browse/SPARK-14056
> Project: Spark
>  Issue Type: Improvement
>  Components: EC2, SQL
>Affects Versions: 1.6.1
>Reporter: Sital Kedia
>Assignee: Sital Kedia
>Priority: Minor
> Fix For: 2.0.0
>
>
> Currently when creating a HiveConf in TableReader.scala, we are not passing 
> s3 specific configurations (like aws s3 credentials) and spark.hadoop.* 
> configurations set by the user.  We should fix this issue. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14056) Add s3 configurations and spark.hadoop.* configurations to hive configuration

2016-04-02 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-14056:
--
Assignee: Sital Kedia

> Add s3 configurations and spark.hadoop.* configurations to hive configuration
> -
>
> Key: SPARK-14056
> URL: https://issues.apache.org/jira/browse/SPARK-14056
> Project: Spark
>  Issue Type: Improvement
>  Components: EC2, SQL
>Affects Versions: 1.6.1
>Reporter: Sital Kedia
>Assignee: Sital Kedia
> Fix For: 2.0.0
>
>
> Currently when creating a HiveConf in TableReader.scala, we are not passing 
> s3 specific configurations (like aws s3 credentials) and spark.hadoop.* 
> configurations set by the user.  We should fix this issue. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-14056) Add s3 configurations and spark.hadoop.* configurations to hive configuration

2016-04-02 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-14056.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 11876
[https://github.com/apache/spark/pull/11876]

> Add s3 configurations and spark.hadoop.* configurations to hive configuration
> -
>
> Key: SPARK-14056
> URL: https://issues.apache.org/jira/browse/SPARK-14056
> Project: Spark
>  Issue Type: Improvement
>  Components: EC2, SQL
>Affects Versions: 1.6.1
>Reporter: Sital Kedia
> Fix For: 2.0.0
>
>
> Currently when creating a HiveConf in TableReader.scala, we are not passing 
> s3 specific configurations (like aws s3 credentials) and spark.hadoop.* 
> configurations set by the user.  We should fix this issue. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13996) Add more not null attributes for Filter codegen

2016-04-02 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-13996:
---
Fix Version/s: (was: 2.1.0)
   2.0.0

> Add more not null attributes for Filter codegen
> ---
>
> Key: SPARK-13996
> URL: https://issues.apache.org/jira/browse/SPARK-13996
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Liang-Chi Hsieh
> Fix For: 2.0.0
>
>
> Filter codegen finds the attributes not null by checking IsNotNull(a) 
> expression with a condition {{if child.output.contains(a)}}. However, the 
> current approach to checking it is not comprehensive. We can improve it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-13996) Add more not null attributes for Filter codegen

2016-04-02 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-13996.

   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 11810
[https://github.com/apache/spark/pull/11810]

> Add more not null attributes for Filter codegen
> ---
>
> Key: SPARK-13996
> URL: https://issues.apache.org/jira/browse/SPARK-13996
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Liang-Chi Hsieh
> Fix For: 2.1.0
>
>
> Filter codegen finds the attributes not null by checking IsNotNull(a) 
> expression with a condition {{if child.output.contains(a)}}. However, the 
> current approach to checking it is not comprehensive. We can improve it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13996) Add more not null attributes for Filter codegen

2016-04-02 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-13996:
---
Assignee: Liang-Chi Hsieh

> Add more not null attributes for Filter codegen
> ---
>
> Key: SPARK-13996
> URL: https://issues.apache.org/jira/browse/SPARK-13996
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
> Fix For: 2.0.0
>
>
> Filter codegen finds the attributes not null by checking IsNotNull(a) 
> expression with a condition {{if child.output.contains(a)}}. However, the 
> current approach to checking it is not comprehensive. We can improve it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14130) [Table related commands] Alter column

2016-04-02 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-14130:
-
Description: 
For alter column command, we have the following tokens.
TOK_ALTERTABLE_RENAMECOL
TOK_ALTERTABLE_ADDCOLS
TOK_ALTERTABLE_REPLACECOLS

For data source tables, we should throw exceptions. For Hive tables, we should 
support them. *For Hive tables, we should check Hive's behavior to see if there 
is any file format that does not any of above command*. 
https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java
 is a good reference for Hive's behavior. 

Also, for a Hive table stored in a format, we need to make sure that even if 
Spark can read this tables after an alter column operation. If we cannot read 
the table, even Hive allows the alter column operation, we should still throw 
an exception. For example, if renaming a column of a Hive parquet table causes 
the renamed column inaccessible (we cannot read values), we should not allow 
this renaming operation.


  was:
For alter column command, we have the following tokens.
TOK_ALTERTABLE_RENAMECOL
TOK_ALTERTABLE_ADDCOLS
TOK_ALTERTABLE_REPLACECOLS

For data source tables, we should throw exceptions. For Hive tables, we should 
support them. *For Hive tables, we should check Hive's behavior to see if there 
is any file format that does not any of above command*. 
https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java
 is a good reference for Hive's behavior. 



> [Table related commands] Alter column
> -
>
> Key: SPARK-14130
> URL: https://issues.apache.org/jira/browse/SPARK-14130
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>
> For alter column command, we have the following tokens.
> TOK_ALTERTABLE_RENAMECOL
> TOK_ALTERTABLE_ADDCOLS
> TOK_ALTERTABLE_REPLACECOLS
> For data source tables, we should throw exceptions. For Hive tables, we 
> should support them. *For Hive tables, we should check Hive's behavior to see 
> if there is any file format that does not any of above command*. 
> https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java
>  is a good reference for Hive's behavior. 
> Also, for a Hive table stored in a format, we need to make sure that even if 
> Spark can read this tables after an alter column operation. If we cannot read 
> the table, even Hive allows the alter column operation, we should still throw 
> an exception. For example, if renaming a column of a Hive parquet table 
> causes the renamed column inaccessible (we cannot read values), we should not 
> allow this renaming operation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-14351) Optimize ImpurityAggregator for decision trees

2016-04-02 Thread Joseph K. Bradley (JIRA)

Joseph K. Bradley created SPARK-14351:
-

 Summary: Optimize ImpurityAggregator for decision trees
 Key: SPARK-14351
 URL: https://issues.apache.org/jira/browse/SPARK-14351
 Project: Spark
  Issue Type: Improvement
  Components: ML
Reporter: Joseph K. Bradley
Priority: Minor


{{RandomForest.binsToBestSplit}} currently takes a large amount of time.  Based 
on some quick profiling, I believe a big chunk of this is spent in 
{{ImpurityAggregator.getCalculator}} (which seems to make unnecessary Array 
copies) and {{RandomForest.calculateImpurityStats}}.

This JIRA is for:
* Doing more profiling to confirm that unnecessary time is being spent in some 
of these methods.
* Optimizing the implementation
* Profiling again to confirm the speedups

Local profiling for large enough examples should suffice, especially since the 
optimizations should not need to change the amount of data communicated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-14231) JSON data source fails to infer floats as decimal when precision is bigger than 38 or scale is bigger than precision.

2016-04-02 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-14231.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 12030
[https://github.com/apache/spark/pull/12030]

> JSON data source fails to infer floats as decimal when precision is bigger 
> than 38 or scale is bigger than precision.
> -
>
> Key: SPARK-14231
> URL: https://issues.apache.org/jira/browse/SPARK-14231
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Hyukjin Kwon
>Priority: Minor
> Fix For: 2.0.0
>
>
> Currently, JSON data source supports {{floatAsBigDecimal}} option, which 
> reads floats as {{DecimalType}}.
> I noticed there are several restrictions in Spark {{DecimalType}} below:
> 1. The precision cannot be bigger than 38.
> 2. scale cannot be bigger than precision. 
> However, with the option above, it reads {{BigDecimal}} which does not follow 
> the conditions above.
> This could be observed as below:
> {code}
> def simpleFloats: RDD[String] =
>   sqlContext.sparkContext.parallelize(
> """{"a": 0.01}""" ::
> """{"a": 0.02}""" :: Nil)
> val jsonDF = sqlContext.read
>   .option("floatAsBigDecimal", "true")
>   .json(simpleFloats)
> jsonDF.printSchema()
> {code}
> throws an exception below:
> {code}
> org.apache.spark.sql.AnalysisException: Decimal scale (2) cannot be greater 
> than precision (1).;
>   at org.apache.spark.sql.types.DecimalType.(DecimalType.scala:44)
>   at 
> org.apache.spark.sql.execution.datasources.json.InferSchema$.org$apache$spark$sql$execution$datasources$json$InferSchema$$inferField(InferSchema.scala:144)
>   at 
> org.apache.spark.sql.execution.datasources.json.InferSchema$.org$apache$spark$sql$execution$datasources$json$InferSchema$$inferField(InferSchema.scala:108)
>   at 
> org.apache.spark.sql.execution.datasources.json.InferSchema$$anonfun$1$$anonfun$apply$1$$anonfun$apply$3.apply(InferSchema.scala:59)
>   at 
> org.apache.spark.sql.execution.datasources.json.InferSchema$$anonfun$1$$anonfun$apply$1$$anonfun$apply$3.apply(InferSchema.scala:57)
>   at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2249)
>   at 
> org.apache.spark.sql.execution.datasources.json.InferSchema$$anonfun$1$$anonfun$apply$1.apply(InferSchema.scala:57)
>   at 
> org.apache.spark.sql.execution.datasources.json.InferSchema$$anonfun$1$$anonfun$apply$1.apply(InferSchema.scala:55)
>   at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:396)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:742)
> ...
> {code}
> Since JSON data source infers {{DataType}} as {{StringType}} when it fails to 
> infer, it might have to be inferred as {{StringType}} or maybe just simply 
> {{DoubleType}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14231) JSON data source fails to infer floats as decimal when precision is bigger than 38 or scale is bigger than precision.

2016-04-02 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-14231:
---
Assignee: Hyukjin Kwon

> JSON data source fails to infer floats as decimal when precision is bigger 
> than 38 or scale is bigger than precision.
> -
>
> Key: SPARK-14231
> URL: https://issues.apache.org/jira/browse/SPARK-14231
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
> Fix For: 2.0.0
>
>
> Currently, JSON data source supports {{floatAsBigDecimal}} option, which 
> reads floats as {{DecimalType}}.
> I noticed there are several restrictions in Spark {{DecimalType}} below:
> 1. The precision cannot be bigger than 38.
> 2. scale cannot be bigger than precision. 
> However, with the option above, it reads {{BigDecimal}} which does not follow 
> the conditions above.
> This could be observed as below:
> {code}
> def simpleFloats: RDD[String] =
>   sqlContext.sparkContext.parallelize(
> """{"a": 0.01}""" ::
> """{"a": 0.02}""" :: Nil)
> val jsonDF = sqlContext.read
>   .option("floatAsBigDecimal", "true")
>   .json(simpleFloats)
> jsonDF.printSchema()
> {code}
> throws an exception below:
> {code}
> org.apache.spark.sql.AnalysisException: Decimal scale (2) cannot be greater 
> than precision (1).;
>   at org.apache.spark.sql.types.DecimalType.(DecimalType.scala:44)
>   at 
> org.apache.spark.sql.execution.datasources.json.InferSchema$.org$apache$spark$sql$execution$datasources$json$InferSchema$$inferField(InferSchema.scala:144)
>   at 
> org.apache.spark.sql.execution.datasources.json.InferSchema$.org$apache$spark$sql$execution$datasources$json$InferSchema$$inferField(InferSchema.scala:108)
>   at 
> org.apache.spark.sql.execution.datasources.json.InferSchema$$anonfun$1$$anonfun$apply$1$$anonfun$apply$3.apply(InferSchema.scala:59)
>   at 
> org.apache.spark.sql.execution.datasources.json.InferSchema$$anonfun$1$$anonfun$apply$1$$anonfun$apply$3.apply(InferSchema.scala:57)
>   at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2249)
>   at 
> org.apache.spark.sql.execution.datasources.json.InferSchema$$anonfun$1$$anonfun$apply$1.apply(InferSchema.scala:57)
>   at 
> org.apache.spark.sql.execution.datasources.json.InferSchema$$anonfun$1$$anonfun$apply$1.apply(InferSchema.scala:55)
>   at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:396)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:742)
> ...
> {code}
> Since JSON data source infers {{DataType}} as {{StringType}} when it fails to 
> infer, it might have to be inferred as {{StringType}} or maybe just simply 
> {{DoubleType}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

83 matches

Mail list logo