date:20170403

[jira] [Commented] (SPARK-20193) Selecting empty struct causes ExpressionEncoder error.

2017-04-03 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15954584#comment-15954584
 ] 

Liang-Chi Hsieh commented on SPARK-20193:
-

Actually I am not sure what {{struct()}} represents.

If you want a null for this struct, you can write:
{code}
spark.range(3).select(col("id"), lit(null).cast(new StructType()))
{code}

> Selecting empty struct causes ExpressionEncoder error.
> --
>
> Key: SPARK-20193
> URL: https://issues.apache.org/jira/browse/SPARK-20193
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Adrian Ionescu
>  Labels: struct
>
> {{def struct(cols: Column*): Column}}
> Given the above signature and the lack of any note in the docs saying that a 
> struct with no columns is not supported, I would expect the following to work:
> {{spark.range(3).select(col("id"), struct().as("empty_struct")).collect}}
> However, this results in:
> {quote}
> java.lang.AssertionError: assertion failed: each serializer expression should 
> contains at least one `BoundReference`
>   at scala.Predef$.assert(Predef.scala:170)
>   at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$$anonfun$11.apply(ExpressionEncoder.scala:240)
>   at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$$anonfun$11.apply(ExpressionEncoder.scala:238)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
>   at scala.collection.immutable.List.flatMap(List.scala:344)
>   at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.(ExpressionEncoder.scala:238)
>   at 
> org.apache.spark.sql.catalyst.encoders.RowEncoder$.apply(RowEncoder.scala:63)
>   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64)
>   at 
> org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withPlan(Dataset.scala:2837)
>   at org.apache.spark.sql.Dataset.select(Dataset.scala:1131)
>   ... 39 elided
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-20207) Add ablity to exclude current row in WindowSpec

2017-04-03 Thread Mathew Wicks (JIRA)

Mathew Wicks created SPARK-20207:


 Summary: Add ablity to exclude current row in WindowSpec
 Key: SPARK-20207
 URL: https://issues.apache.org/jira/browse/SPARK-20207
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.1.0
Reporter: Mathew Wicks
Priority: Minor


It would be useful if we could implement a way to exclude the current row in 
WindowSpec. (We can currently only select ranges of rows/time.)

Currently, users have to resort to ridiculous measures to exclude the current 
row from windowing aggregations. 

As seen here:
http://stackoverflow.com/questions/43180723/spark-sql-excluding-the-current-row-in-partition-by-windowing-functions/43198839#43198839



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20144) spark.read.parquet no long maintains ordering of the data

2017-04-03 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15954559#comment-15954559
 ] 

Liang-Chi Hsieh commented on SPARK-20144:
-

I don't think the API has the guarantee about the data ordering. The difference 
between 1.6.3 to 2.0.2 is just due to the change of internal implementation.

I checked the current FileSourceScanExec, it still reorders the partition files.

When you save the sorted data into Parquet, only the data in individual Parquet 
file can maintain the data ordering. We shouldn't expect a special ordering on 
the whole data read back, if the API doesn't explicitly guarantee that.

> spark.read.parquet no long maintains ordering of the data
> -
>
> Key: SPARK-20144
> URL: https://issues.apache.org/jira/browse/SPARK-20144
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: Li Jin
>
> Hi, We are trying to upgrade Spark from 1.6.3 to 2.0.2. One issue we found is 
> when we read parquet files in 2.0.2, the ordering of rows in the resulting 
> dataframe is not the same as the ordering of rows in the dataframe that the 
> parquet file was reproduced with. 
> This is because FileSourceStrategy.scala combines the parquet files into 
> fewer partitions and also reordered them. This breaks our workflows because 
> they assume the ordering of the data. 
> Is this considered a bug? Also FileSourceStrategy and FileSourceScanExec 
> changed quite a bit from 2.0.2 to 2.1, so not sure if this is an issue with 
> 2.1.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20079) Re registration of AM hangs spark cluster in yarn-client mode

2017-04-03 Thread Guoqiang Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guoqiang Li updated SPARK-20079:

Description: 
The ExecutorAllocationManager.reset method is called when re-registering AM, 
which sets the ExecutorAllocationManager.initializing field true. When this 
field is true, the Driver does not start a new executor from the AM request. 
The following two cases will cause the field to False

1. A executor idle for some time.
2. There are new stages to be submitted


After the a stage was submitted, the AM was killed and restart ,the above two 
cases will not appear.

1. When AM is killed, the yarn will kill all running containers. All execuotr 
will be lost and no executor will be idle.
2. No surviving executor, resulting in the current stage will never be 
completed, DAG will not submit a new stage.


Reproduction steps:

1. Start cluster

{noformat}
echo -e "sc.parallelize(1 to 2000).foreach(_ => Thread.sleep(1000))" | 
./bin/spark-shell  --master yarn-client --executor-cores 1 --conf 
spark.shuffle.service.enabled=true --conf spark.dynamicAllocation.enabled=true 
--conf spark.dynamicAllocation.maxExecutors=2
{noformat}

2.  Kill the AM process when a stage is scheduled. 

  was:

1. Start cluster

echo -e "sc.parallelize(1 to 2000).foreach(_ => Thread.sleep(1000))" | 
./bin/spark-shell  --master yarn-client --executor-cores 1 --conf 
spark.shuffle.service.enabled=true --conf spark.dynamicAllocation.enabled=true 
--conf spark.dynamicAllocation.maxExecutors=2 

2.  Kill the AM process when a stage is scheduled. 


> Re registration of AM hangs spark cluster in yarn-client mode
> -
>
> Key: SPARK-20079
> URL: https://issues.apache.org/jira/browse/SPARK-20079
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.1.0
>Reporter: Guoqiang Li
>
> The ExecutorAllocationManager.reset method is called when re-registering AM, 
> which sets the ExecutorAllocationManager.initializing field true. When this 
> field is true, the Driver does not start a new executor from the AM request. 
> The following two cases will cause the field to False
> 1. A executor idle for some time.
> 2. There are new stages to be submitted
> After the a stage was submitted, the AM was killed and restart ,the above two 
> cases will not appear.
> 1. When AM is killed, the yarn will kill all running containers. All execuotr 
> will be lost and no executor will be idle.
> 2. No surviving executor, resulting in the current stage will never be 
> completed, DAG will not submit a new stage.
> Reproduction steps:
> 1. Start cluster
> {noformat}
> echo -e "sc.parallelize(1 to 2000).foreach(_ => Thread.sleep(1000))" | 
> ./bin/spark-shell  --master yarn-client --executor-cores 1 --conf 
> spark.shuffle.service.enabled=true --conf 
> spark.dynamicAllocation.enabled=true --conf 
> spark.dynamicAllocation.maxExecutors=2
> {noformat}
> 2.  Kill the AM process when a stage is scheduled. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11421) Add the ability to add a jar to the current class loader

2017-04-03 Thread Daniel Erenrich (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15954533#comment-15954533
 ] 

Daniel Erenrich commented on SPARK-11421:
-

Is this not basically a duplicate of the much older 
https://issues.apache.org/jira/browse/SPARK-5377 

> Add the ability to add a jar to the current class loader
> 
>
> Key: SPARK-11421
> URL: https://issues.apache.org/jira/browse/SPARK-11421
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: holdenk
>Priority: Minor
>
> addJar add's jars for future operations, but could also add to the current 
> class loader, this would be really useful in Python & R most likely where 
> some included python code may wish to add some jars.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20176) Spark Dataframe UDAF issue

2017-04-03 Thread Dinesh Man Amatya (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15954527#comment-15954527
 ] 

Dinesh Man Amatya commented on SPARK-20176:
---

Thanks Kazuaki for the effort. I was able to resolve the issue by upgrading the 
spark and scala version as follows,

scala.version : 2.11.5

scala.compat.version : 2.11

spark.version : 2.1.0

> Spark Dataframe UDAF issue
> --
>
> Key: SPARK-20176
> URL: https://issues.apache.org/jira/browse/SPARK-20176
> Project: Spark
>  Issue Type: IT Help
>  Components: Spark Core
>Affects Versions: 2.0.2
>Reporter: Dinesh Man Amatya
>
> Getting following error in custom UDAF
> Error while decoding: java.util.concurrent.ExecutionException: 
> java.lang.Exception: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 58, Column 33: Incompatible expression types "boolean" and "java.lang.Boolean"
> /* 001 */ public java.lang.Object generate(Object[] references) {
> /* 002 */   return new SpecificSafeProjection(references);
> /* 003 */ }
> /* 004 */
> /* 005 */ class SpecificSafeProjection extends 
> org.apache.spark.sql.catalyst.expressions.codegen.BaseProjection {
> /* 006 */
> /* 007 */   private Object[] references;
> /* 008 */   private MutableRow mutableRow;
> /* 009 */   private Object[] values;
> /* 010 */   private Object[] values1;
> /* 011 */   private org.apache.spark.sql.types.StructType schema;
> /* 012 */   private org.apache.spark.sql.types.StructType schema1;
> /* 013 */
> /* 014 */
> /* 015 */   public SpecificSafeProjection(Object[] references) {
> /* 016 */ this.references = references;
> /* 017 */ mutableRow = (MutableRow) references[references.length - 1];
> /* 018 */
> /* 019 */
> /* 020 */ this.schema = (org.apache.spark.sql.types.StructType) 
> references[0];
> /* 021 */ this.schema1 = (org.apache.spark.sql.types.StructType) 
> references[1];
> /* 022 */   }
> /* 023 */
> /* 024 */   public java.lang.Object apply(java.lang.Object _i) {
> /* 025 */ InternalRow i = (InternalRow) _i;
> /* 026 */
> /* 027 */ values = new Object[2];
> /* 028 */
> /* 029 */ boolean isNull2 = i.isNullAt(0);
> /* 030 */ UTF8String value2 = isNull2 ? null : (i.getUTF8String(0));
> /* 031 */
> /* 032 */ boolean isNull1 = isNull2;
> /* 033 */ final java.lang.String value1 = isNull1 ? null : 
> (java.lang.String) value2.toString();
> /* 034 */ isNull1 = value1 == null;
> /* 035 */ if (isNull1) {
> /* 036 */   values[0] = null;
> /* 037 */ } else {
> /* 038 */   values[0] = value1;
> /* 039 */ }
> /* 040 */
> /* 041 */ boolean isNull5 = i.isNullAt(1);
> /* 042 */ InternalRow value5 = isNull5 ? null : (i.getStruct(1, 2));
> /* 043 */ boolean isNull3 = false;
> /* 044 */ org.apache.spark.sql.Row value3 = null;
> /* 045 */ if (!false && isNull5) {
> /* 046 */
> /* 047 */   final org.apache.spark.sql.Row value6 = null;
> /* 048 */   isNull3 = true;
> /* 049 */   value3 = value6;
> /* 050 */ } else {
> /* 051 */
> /* 052 */   values1 = new Object[2];
> /* 053 */
> /* 054 */   boolean isNull10 = i.isNullAt(1);
> /* 055 */   InternalRow value10 = isNull10 ? null : (i.getStruct(1, 2));
> /* 056 */
> /* 057 */   boolean isNull9 = isNull10 || false;
> /* 058 */   final boolean value9 = isNull9 ? false : (Boolean) 
> value10.isNullAt(0);
> /* 059 */   boolean isNull8 = false;
> /* 060 */   double value8 = -1.0;
> /* 061 */   if (!isNull9 && value9) {
> /* 062 */
> /* 063 */ final double value12 = -1.0;
> /* 064 */ isNull8 = true;
> /* 065 */ value8 = value12;
> /* 066 */   } else {
> /* 067 */
> /* 068 */ boolean isNull14 = i.isNullAt(1);
> /* 069 */ InternalRow value14 = isNull14 ? null : (i.getStruct(1, 2));
> /* 070 */ boolean isNull13 = isNull14;
> /* 071 */ double value13 = -1.0;
> /* 072 */
> /* 073 */ if (!isNull14) {
> /* 074 */
> /* 075 */   if (value14.isNullAt(0)) {
> /* 076 */ isNull13 = true;
> /* 077 */   } else {
> /* 078 */ value13 = value14.getDouble(0);
> /* 079 */   }
> /* 080 */
> /* 081 */ }
> /* 082 */ isNull8 = isNull13;
> /* 083 */ value8 = value13;
> /* 084 */   }
> /* 085 */   if (isNull8) {
> /* 086 */ values1[0] = null;
> /* 087 */   } else {
> /* 088 */ values1[0] = value8;
> /* 089 */   }
> /* 090 */
> /* 091 */   boolean isNull17 = i.isNullAt(1);
> /* 092 */   InternalRow value17 = isNull17 ? null : (i.getStruct(1, 2));
> /* 093 */
> /* 094 */   boolean isNull16 = isNull17 || false;
> /* 095 */   final boolean value16 = isNull16 ? false : (Boolean) 
> value17.isNullAt(1);
> /* 096 */   boolean

[jira] [Updated] (SPARK-20206) spark.ui.killEnabled=false property doesn't reflect on task/stages

2017-04-03 Thread srinivasan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

srinivasan updated SPARK-20206:
---
Priority: Minor  (was: Major)

> spark.ui.killEnabled=false property doesn't reflect on task/stages
> --
>
> Key: SPARK-20206
> URL: https://issues.apache.org/jira/browse/SPARK-20206
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.1.0
>Reporter: srinivasan
>Priority: Minor
>
> spark.ui.killEnabled=false property doesn't reflect on active task and 
> stages.kill hyperlink is still enabled on active tasks and stages



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-20206) spark.ui.killEnabled=false property doesn't reflect on task/stages

2017-04-03 Thread srinivasan (JIRA)

srinivasan created SPARK-20206:
--

 Summary: spark.ui.killEnabled=false property doesn't reflect on 
task/stages
 Key: SPARK-20206
 URL: https://issues.apache.org/jira/browse/SPARK-20206
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 2.1.0
Reporter: srinivasan


spark.ui.killEnabled=false property doesn't reflect on active task and 
stages.kill hyperlink is still enabled on active tasks and stages



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-14726) Support for sampling when inferring schema in CSV data source

2017-04-03 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15954447#comment-15954447
 ] 

Hyukjin Kwon edited comment on SPARK-14726 at 4/4/17 1:47 AM:
--

Actually, after re-thinking, it seems we would not need this for now if not 
many users request this.

Now we can do a workaround as below:

{code}
val ds = Seq("a", "b", "c", "d").toDS
val sampledSchema = spark.read.option("inferSchema", true).csv(ds.sample(false, 
0.7)).schema
spark.read.schema(sampledSchema).csv(ds)
{code}

Actually, this will allow more dynamic options, e.g., with replacement or 
without replacement or filtering or even just limit 100.

I will keep eyes on similar issues and reopen if it seems many users want this.

Please reopen this if you strongly feel this should be supported as an option 
or anyone feels so.




was (Author: hyukjin.kwon):
Actually, after re-thinking, it seems we would not need this for now if not 
many users request this.

Workaround as below:

{code}
val ds = Seq("a", "b", "c", "d").toDS
val sampledSchema = spark.read.option("inferSchema", true).csv(ds.sample(false, 
0.7)).schema
spark.read.schema(sampledSchema).csv(ds)
{code}

Actually, this will allow more dynamic options, e.g., with replacement or 
without replacement or filtering or even just limit 100.

I will keep eyes on similar issues and reopen if it seems many users want this.

Please reopen this if you strongly feel this should be supported as an option 
or anyone feels so.



> Support for sampling when inferring schema in CSV data source
> -
>
> Key: SPARK-14726
> URL: https://issues.apache.org/jira/browse/SPARK-14726
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Bomi Kim
>
> Currently, I am using CSV data source and trying to get used to Spark 2.0 
> because it has built-in CSV data source.
> I realized that CSV data source infers schema with all the data. JSON data 
> source supports sampling ratio option.
> It would be great if CSV data source has this option too (or is this 
> supported already?).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-14726) Support for sampling when inferring schema in CSV data source

2017-04-03 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15954447#comment-15954447
 ] 

Hyukjin Kwon edited comment on SPARK-14726 at 4/4/17 1:40 AM:
--

Actually, after re-thinking, it seems we would not need this for now if not 
many users request this.

Workaround as below:

{code}
val ds = Seq("a", "b", "c", "d").toDS
val sampledSchema = spark.read.option("inferSchema", true).csv(ds.sample(false, 
0.7)).schema
spark.read.schema(sampledSchema).csv(ds)
{code}

Actually, this will allow more dynamic options, e.g., with replacement or 
without replacement or filtering or even just limit 100.

I will keep eyes on similar issues and reopen if it seems many users want this.

Please reopen this if you strongly feel this should be supported as an option 
or anyone feels so.




was (Author: hyukjin.kwon):
Actually, after re-thinking, it seems we would not need this for now if not 
many users request this.

Workaround as below:

{code}
val ds = Seq("a", "b", "c", "d").toDS.sample(false, 0.7)
val sampledSchema = spark.read.option("inferSchema", true).csv(ds).schema
spark.read.schema(sampledSchema).csv("/tmp/path")
{code}

Actually, this will allow more dynamic options, e.g., with replacement or 
without replacement or filtering or even just limit 100.

I will keep eyes on similar issues and reopen if it seems many users want this.

Please reopen this if you strongly feel this should be supported as an option 
or anyone feels so.



> Support for sampling when inferring schema in CSV data source
> -
>
> Key: SPARK-14726
> URL: https://issues.apache.org/jira/browse/SPARK-14726
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Bomi Kim
>
> Currently, I am using CSV data source and trying to get used to Spark 2.0 
> because it has built-in CSV data source.
> I realized that CSV data source infers schema with all the data. JSON data 
> source supports sampling ratio option.
> It would be great if CSV data source has this option too (or is this 
> supported already?).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-14726) Support for sampling when inferring schema in CSV data source

2017-04-03 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-14726.
--
Resolution: Won't Fix

Actually, after re-thinking, it seems we would not need this for now if not 
many users request this.

Workaround as below:

{code}
val ds = Seq("a", "b", "c", "d").toDS.sample(false, 0.7)
val sampledSchema = spark.read.option("inferSchema", true).csv(ds).schema
spark.read.schema(sampledSchema).csv("/tmp/path")
{code}

Actually, this will allow more dynamic options, e.g., with replacement or 
without replacement or filtering or even just limit 100.

I will keep eyes on similar issues and reopen if it seems many users want this.

Please reopen this if you strongly feel this should be supported as an option 
or anyone feels so.



> Support for sampling when inferring schema in CSV data source
> -
>
> Key: SPARK-14726
> URL: https://issues.apache.org/jira/browse/SPARK-14726
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Bomi Kim
>
> Currently, I am using CSV data source and trying to get used to Spark 2.0 
> because it has built-in CSV data source.
> I realized that CSV data source infers schema with all the data. JSON data 
> source supports sampling ratio option.
> It would be great if CSV data source has this option too (or is this 
> supported already?).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-19186) Hash symbol in middle of Sybase database table name causes Spark Exception

2017-04-03 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-19186.
--
Resolution: Not A Problem

^ I agree with this. Also, up to my knowledge, we can deal with the dialect in 
favour of SPARK-17614, assuming the exception came from 
https://github.com/apache/spark/blob/branch-2.1/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala#L60-L62
 within Spark.

I am resolving this per the issue described in this JIRA. Please reopen this if 
I misunderstood.

> Hash symbol in middle of Sybase database table name causes Spark Exception
> --
>
> Key: SPARK-19186
> URL: https://issues.apache.org/jira/browse/SPARK-19186
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.1.0
>Reporter: Adrian Schulewitz
>Priority: Minor
>
> If I use a table name without a '#' symbol in the middle then no exception 
> occurs but with one an exception is thrown. According to Sybase 15 
> documentation a '#' is a legal character.
> val testSql = "SELECT * FROM CTP#ADR_TYPE_DBF"
> val conf = new SparkConf().setAppName("MUREX DMart Simple Reader via 
> SQL").setMaster("local[2]")
> val sess = SparkSession
>   .builder()
>   .appName("MUREX DMart Simple SQL Reader")
>   .config(conf)
>   .getOrCreate()
> import sess.implicits._
> val df = sess.read
> .format("jdbc")
> .option("url", 
> "jdbc:jtds:sybase://auq7064s.unix.anz:4020/mxdmart56")
> .option("driver", "net.sourceforge.jtds.jdbc.Driver")
> .option("dbtable", "CTP#ADR_TYPE_DBF")
> .option("UDT_DEALCRD_REP", "mxdmart56")
> .option("user", "INSTAL")
> .option("password", "INSTALL")
> .load()
> df.createOrReplaceTempView("trades")
> val resultsDF = sess.sql(testSql)
> resultsDF.show()
> 17/01/12 14:30:01 INFO SharedState: Warehouse path is 
> 'file:/C:/DEVELOPMENT/Projects/MUREX/trunk/murex-eom-reporting/spark-warehouse/'.
> 17/01/12 14:30:04 INFO SparkSqlParser: Parsing command: trades
> 17/01/12 14:30:04 INFO SparkSqlParser: Parsing command: SELECT * FROM 
> CTP#ADR_TYPE_DBF
> Exception in thread "main" 
> org.apache.spark.sql.catalyst.parser.ParseException: 
> extraneous input '#' expecting {, ',', 'SELECT', 'FROM', 'ADD', 'AS', 
> 'ALL', 'DISTINCT', 'WHERE', 'GROUP', 'BY', 'GROUPING', 'SETS', 'CUBE', 
> 'ROLLUP', 'ORDER', 'HAVING', 'LIMIT', 'AT', 'OR', 'AND', 'IN', NOT, 'NO', 
> 'EXISTS', 'BETWEEN', 'LIKE', RLIKE, 'IS', 'NULL', 'TRUE', 'FALSE', 'NULLS', 
> 'ASC', 'DESC', 'FOR', 'INTERVAL', 'CASE', 'WHEN', 'THEN', 'ELSE', 'END', 
> 'JOIN', 'CROSS', 'OUTER', 'INNER', 'LEFT', 'RIGHT', 'FULL', 'NATURAL', 
> 'LATERAL', 'WINDOW', 'OVER', 'PARTITION', 'RANGE', 'ROWS', 'UNBOUNDED', 
> 'PRECEDING', 'FOLLOWING', 'CURRENT', 'FIRST', 'LAST', 'ROW', 'WITH', 
> 'VALUES', 'CREATE', 'TABLE', 'VIEW', 'REPLACE', 'INSERT', 'DELETE', 'INTO', 
> 'DESCRIBE', 'EXPLAIN', 'FORMAT', 'LOGICAL', 'CODEGEN', 'CAST', 'SHOW', 
> 'TABLES', 'COLUMNS', 'COLUMN', 'USE', 'PARTITIONS', 'FUNCTIONS', 'DROP', 
> 'UNION', 'EXCEPT', 'MINUS', 'INTERSECT', 'TO', 'TABLESAMPLE', 'STRATIFY', 
> 'ALTER', 'RENAME', 'ARRAY', 'MAP', 'STRUCT', 'COMMENT', 'SET', 'RESET', 
> 'DATA', 'START', 'TRANSACTION', 'COMMIT', 'ROLLBACK', 'MACRO', 'IF', 'DIV', 
> 'PERCENT', 'BUCKET', 'OUT', 'OF', 'SORT', 'CLUSTER', 'DISTRIBUTE', 
> 'OVERWRITE', 'TRANSFORM', 'REDUCE', 'USING', 'SERDE', 'SERDEPROPERTIES', 
> 'RECORDREADER', 'RECORDWRITER', 'DELIMITED', 'FIELDS', 'TERMINATED', 
> 'COLLECTION', 'ITEMS', 'KEYS', 'ESCAPED', 'LINES', 'SEPARATED', 'FUNCTION', 
> 'EXTENDED', 'REFRESH', 'CLEAR', 'CACHE', 'UNCACHE', 'LAZY', 'FORMATTED', 
> 'GLOBAL', TEMPORARY, 'OPTIONS', 'UNSET', 'TBLPROPERTIES', 'DBPROPERTIES', 
> 'BUCKETS', 'SKEWED', 'STORED', 'DIRECTORIES', 'LOCATION', 'EXCHANGE', 
> 'ARCHIVE', 'UNARCHIVE', 'FILEFORMAT', 'TOUCH', 'COMPACT', 'CONCATENATE', 
> 'CHANGE', 'CASCADE', 'RESTRICT', 'CLUSTERED', 'SORTED', 'PURGE', 
> 'INPUTFORMAT', 'OUTPUTFORMAT', DATABASE, DATABASES, 'DFS', 'TRUNCATE', 
> 'ANALYZE', 'COMPUTE', 'LIST', 'STATISTICS', 'PARTITIONED', 'EXTERNAL', 
> 'DEFINED', 'REVOKE', 'GRANT', 'LOCK', 'UNLOCK', 'MSCK', 'REPAIR', 'RECOVER', 
> 'EXPORT', 'IMPORT', 'LOAD', 'ROLE', 'ROLES', 'COMPACTIONS', 'PRINCIPALS', 
> 'TRANSACTIONS', 'INDEX', 'INDEXES', 'LOCKS', 'OPTION', 'ANTI', 'LOCAL', 
> 'INPATH', 'CURRENT_DATE', 'CURRENT_TIMESTAMP', IDENTIFIER, 
> BACKQUOTED_IDENTIFIER}(line 1, pos 17)
> == SQL ==
> SELECT * FROM CTP#ADR_TYPE_DBF
> -^^^
>   at 
> org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:197)
>   at 
>

[jira] [Resolved] (SPARK-10364) Support Parquet logical type TIMESTAMP_MILLIS

2017-04-03 Thread Takuya Ueshin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takuya Ueshin resolved SPARK-10364.
---
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 15332
[https://github.com/apache/spark/pull/15332]

> Support Parquet logical type TIMESTAMP_MILLIS
> -
>
> Key: SPARK-10364
> URL: https://issues.apache.org/jira/browse/SPARK-10364
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Cheng Lian
> Fix For: 2.2.0
>
>
> The {{TimestampType}} in Spark SQL is of microsecond precision. Ideally, we 
> should convert Spark SQL timestamp values into Parquet {{TIMESTAMP_MICROS}}. 
> But unfortunately parquet-mr hasn't supported it yet.
> For the read path, we should be able to read {{TIMESTAMP_MILLIS}} Parquet 
> values and pad a 0 microsecond part to read values.
> For the write path, currently we are writing timestamps as {{INT96}}, similar 
> to Impala and Hive. One alternative is that, we can have a separate SQL 
> option to let users be able to write Spark SQL timestamp values as 
> {{TIMESTAMP_MILLIS}}. Of course, in this way the microsecond part will be 
> truncated.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19408) cardinality estimation involving two columns of the same table

2017-04-03 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-19408:

Description: 
In SPARK-17075, we estimate cardinality of predicate expression "column (op) 
literal", where op is =, <, <=, >, >= or <=>.  In SQL queries, we also see 
predicate expressions involving two columns such as "column-1 (op) column-2" 
where column-1 and column-2 belong to same table.  Note that, if column-1 and 
column-2 belong to different tables, then it is a join operator's work, NOT a 
filter operator's work.

In this jira, we want to estimate the filter factor of predicate expressions 
involving two columns of same table.   For example, multiple tpc-h queries have 
this kind of predicate "WHERE l_commitdate < l_receiptdate".

  was:
In SPARK-17075, we estimate cardinality of predicate expression "column (op) 
literal", where op is =, <, <=, >, or >=.  In SQL queries, we also see 
predicate expressions involving two columns such as "column-1 (op) column-2" 
where column-1 and column-2 belong to same table.  Note that, if column-1 and 
column-2 belong to different tables, then it is a join operator's work, NOT a 
filter operator's work.

In this jira, we want to estimate the filter factor of predicate expressions 
involving two columns of same table.   For example, multiple tpc-h queries have 
this kind of predicate "WHERE l_commitdate < l_receiptdate".


> cardinality estimation involving two columns of the same table
> --
>
> Key: SPARK-19408
> URL: https://issues.apache.org/jira/browse/SPARK-19408
> Project: Spark
>  Issue Type: Sub-task
>  Components: Optimizer
>Affects Versions: 2.1.0
>Reporter: Ron Hu
> Fix For: 2.2.0
>
>
> In SPARK-17075, we estimate cardinality of predicate expression "column (op) 
> literal", where op is =, <, <=, >, >= or <=>.  In SQL queries, we also see 
> predicate expressions involving two columns such as "column-1 (op) column-2" 
> where column-1 and column-2 belong to same table.  Note that, if column-1 and 
> column-2 belong to different tables, then it is a join operator's work, NOT a 
> filter operator's work.
> In this jira, we want to estimate the filter factor of predicate expressions 
> involving two columns of same table.   For example, multiple tpc-h queries 
> have this kind of predicate "WHERE l_commitdate < l_receiptdate".



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-19408) cardinality estimation involving two columns of the same table

2017-04-03 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-19408.
-
   Resolution: Fixed
 Assignee: Ron Hu
Fix Version/s: 2.2.0

> cardinality estimation involving two columns of the same table
> --
>
> Key: SPARK-19408
> URL: https://issues.apache.org/jira/browse/SPARK-19408
> Project: Spark
>  Issue Type: Sub-task
>  Components: Optimizer
>Affects Versions: 2.1.0
>Reporter: Ron Hu
>Assignee: Ron Hu
> Fix For: 2.2.0
>
>
> In SPARK-17075, we estimate cardinality of predicate expression "column (op) 
> literal", where op is =, <, <=, >, >= or <=>.  In SQL queries, we also see 
> predicate expressions involving two columns such as "column-1 (op) column-2" 
> where column-1 and column-2 belong to same table.  Note that, if column-1 and 
> column-2 belong to different tables, then it is a join operator's work, NOT a 
> filter operator's work.
> In this jira, we want to estimate the filter factor of predicate expressions 
> involving two columns of same table.   For example, multiple tpc-h queries 
> have this kind of predicate "WHERE l_commitdate < l_receiptdate".



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-20145) "SELECT * FROM range(1)" works, but "SELECT * FROM RANGE(1)" doesn't

2017-04-03 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-20145.
-
   Resolution: Fixed
 Assignee: sam elamin
Fix Version/s: 2.2.0

> "SELECT * FROM range(1)" works, but "SELECT * FROM RANGE(1)" doesn't
> 
>
> Key: SPARK-20145
> URL: https://issues.apache.org/jira/browse/SPARK-20145
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Juliusz Sompolski
>Assignee: sam elamin
> Fix For: 2.2.0
>
>
> Executed at clean tip of the master branch, with all default settings:
> scala> spark.sql("SELECT * FROM range(1)")
> res1: org.apache.spark.sql.DataFrame = [id: bigint]
> scala> spark.sql("SELECT * FROM RANGE(1)")
> org.apache.spark.sql.AnalysisException: could not resolve `RANGE` to a 
> table-valued function; line 1 pos 14
>   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
>   at 
> org.apache.spark.sql.catalyst.analysis.ResolveTableValuedFunctions$$anonfun$apply$1.applyOrElse(ResolveTableValuedFunctions.scala:126)
>   at 
> org.apache.spark.sql.catalyst.analysis.ResolveTableValuedFunctions$$anonfun$apply$1.applyOrElse(ResolveTableValuedFunctions.scala:106)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:62)
> ...
> I believe it should be case insensitive?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-20205) DAGScheduler posts SparkListenerStageSubmitted before updating stage

2017-04-03 Thread Mridul Muralidharan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15954387#comment-15954387
 ] 

Mridul Muralidharan edited comment on SPARK-20205 at 4/4/17 12:15 AM:
--

For history server that will fail - good point.
Atleast for custom listeners, users can workaround until next release by using 
current time (in there code when field submissionTime  is None).
Thanks for clarifying [~vanzin] !


was (Author: mridulm80):
For history server that will fail - good point. Atleast for custom listeners, 
users can workaround until next release by using current time.
Thanks for clarifying [~vanzin] !

> DAGScheduler posts SparkListenerStageSubmitted before updating stage
> 
>
> Key: SPARK-20205
> URL: https://issues.apache.org/jira/browse/SPARK-20205
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Marcelo Vanzin
>
> Probably affects other versions, haven't checked.
> The code that submits the event to the bus is around line 991:
> {code}
> stage.makeNewStageAttempt(partitionsToCompute.size, 
> taskIdToLocations.values.toSeq)
> listenerBus.post(SparkListenerStageSubmitted(stage.latestInfo, 
> properties))
> {code}
> Later in the same method, the stage information is updated (around line 1057):
> {code}
> if (tasks.size > 0) {
>   logInfo(s"Submitting ${tasks.size} missing tasks from $stage 
> (${stage.rdd}) (first 15 " +
> s"tasks are for partitions ${tasks.take(15).map(_.partitionId)})")
>   taskScheduler.submitTasks(new TaskSet(
> tasks.toArray, stage.id, stage.latestInfo.attemptId, jobId, 
> properties))
>   stage.latestInfo.submissionTime = Some(clock.getTimeMillis())
> {code}
> That means an event handler might get a stage submitted event with an unset 
> submission time.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20205) DAGScheduler posts SparkListenerStageSubmitted before updating stage

2017-04-03 Thread Mridul Muralidharan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15954387#comment-15954387
 ] 

Mridul Muralidharan commented on SPARK-20205:
-

For history server that will fail - good point. Atleast for custom listeners, 
users can workaround until next release by using current time.
Thanks for clarifying [~vanzin] !

> DAGScheduler posts SparkListenerStageSubmitted before updating stage
> 
>
> Key: SPARK-20205
> URL: https://issues.apache.org/jira/browse/SPARK-20205
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Marcelo Vanzin
>
> Probably affects other versions, haven't checked.
> The code that submits the event to the bus is around line 991:
> {code}
> stage.makeNewStageAttempt(partitionsToCompute.size, 
> taskIdToLocations.values.toSeq)
> listenerBus.post(SparkListenerStageSubmitted(stage.latestInfo, 
> properties))
> {code}
> Later in the same method, the stage information is updated (around line 1057):
> {code}
> if (tasks.size > 0) {
>   logInfo(s"Submitting ${tasks.size} missing tasks from $stage 
> (${stage.rdd}) (first 15 " +
> s"tasks are for partitions ${tasks.take(15).map(_.partitionId)})")
>   taskScheduler.submitTasks(new TaskSet(
> tasks.toArray, stage.id, stage.latestInfo.attemptId, jobId, 
> properties))
>   stage.latestInfo.submissionTime = Some(clock.getTimeMillis())
> {code}
> That means an event handler might get a stage submitted event with an unset 
> submission time.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-18893) Not support "alter table .. add columns .."

2017-04-03 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-18893.
-
   Resolution: Fixed
Fix Version/s: 2.2.0

> Not support "alter table .. add columns .." 
> 
>
> Key: SPARK-18893
> URL: https://issues.apache.org/jira/browse/SPARK-18893
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: zuotingbing
> Fix For: 2.2.0
>
>
> when we update spark from version 1.5.2 to 2.0.1, all cases we have need 
> change the table use "alter table add columns " failed, but it is said "All 
> Hive DDL Functions, including: alter table" in the official document : 
> http://spark.apache.org/docs/latest/sql-programming-guide.html.
> Is there any plan to support  sql "alter table .. add/replace columns" ?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18893) Not support "alter table .. add columns .."

2017-04-03 Thread Wenchen Fan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15954375#comment-15954375
 ] 

Wenchen Fan commented on SPARK-18893:
-

https://issues.apache.org/jira/browse/SPARK-19261

> Not support "alter table .. add columns .." 
> 
>
> Key: SPARK-18893
> URL: https://issues.apache.org/jira/browse/SPARK-18893
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: zuotingbing
>
> when we update spark from version 1.5.2 to 2.0.1, all cases we have need 
> change the table use "alter table add columns " failed, but it is said "All 
> Hive DDL Functions, including: alter table" in the official document : 
> http://spark.apache.org/docs/latest/sql-programming-guide.html.
> Is there any plan to support  sql "alter table .. add/replace columns" ?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20205) DAGScheduler posts SparkListenerStageSubmitted before updating stage

2017-04-03 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15954357#comment-15954357
 ] 

Marcelo Vanzin commented on SPARK-20205:


bq. I was referring to the case where we are persisting to event log or 
consuming events to externally persist them.

I see. In that case I believe it will always be unset. For live listeners, 
current time is a good enough approximation, but for the history server, for 
example, that's not an option (since {{SparkListenerStageSubmitted}} does not 
have a {{time}} field).

> DAGScheduler posts SparkListenerStageSubmitted before updating stage
> 
>
> Key: SPARK-20205
> URL: https://issues.apache.org/jira/browse/SPARK-20205
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Marcelo Vanzin
>
> Probably affects other versions, haven't checked.
> The code that submits the event to the bus is around line 991:
> {code}
> stage.makeNewStageAttempt(partitionsToCompute.size, 
> taskIdToLocations.values.toSeq)
> listenerBus.post(SparkListenerStageSubmitted(stage.latestInfo, 
> properties))
> {code}
> Later in the same method, the stage information is updated (around line 1057):
> {code}
> if (tasks.size > 0) {
>   logInfo(s"Submitting ${tasks.size} missing tasks from $stage 
> (${stage.rdd}) (first 15 " +
> s"tasks are for partitions ${tasks.take(15).map(_.partitionId)})")
>   taskScheduler.submitTasks(new TaskSet(
> tasks.toArray, stage.id, stage.latestInfo.attemptId, jobId, 
> properties))
>   stage.latestInfo.submissionTime = Some(clock.getTimeMillis())
> {code}
> That means an event handler might get a stage submitted event with an unset 
> submission time.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20205) DAGScheduler posts SparkListenerStageSubmitted before updating stage

2017-04-03 Thread Mridul Muralidharan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15954348#comment-15954348
 ] 

Mridul Muralidharan commented on SPARK-20205:
-

bq. I wouldn't say incorrect; at worst it's gonna be slightly inaccurate.

I was referring to the case where we are persisting to event log or consuming 
events to externally persist them.
In this context, will we always have unspecified submissionTime  or is there 
case where submissionTime  is pointing to some incorrect/spurious value (if 
this is always in the codepath after makeNewStageAttempt; then it should be 
fine).

Essentially, is the workaround for existing spark versions to simply set 
submissionTime to current time if it is None for SparkListenerStageSubmitted 
sufficient ? Will it miss some corner case ? (value is set but is incorrect ?)

> DAGScheduler posts SparkListenerStageSubmitted before updating stage
> 
>
> Key: SPARK-20205
> URL: https://issues.apache.org/jira/browse/SPARK-20205
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Marcelo Vanzin
>
> Probably affects other versions, haven't checked.
> The code that submits the event to the bus is around line 991:
> {code}
> stage.makeNewStageAttempt(partitionsToCompute.size, 
> taskIdToLocations.values.toSeq)
> listenerBus.post(SparkListenerStageSubmitted(stage.latestInfo, 
> properties))
> {code}
> Later in the same method, the stage information is updated (around line 1057):
> {code}
> if (tasks.size > 0) {
>   logInfo(s"Submitting ${tasks.size} missing tasks from $stage 
> (${stage.rdd}) (first 15 " +
> s"tasks are for partitions ${tasks.take(15).map(_.partitionId)})")
>   taskScheduler.submitTasks(new TaskSet(
> tasks.toArray, stage.id, stage.latestInfo.attemptId, jobId, 
> properties))
>   stage.latestInfo.submissionTime = Some(clock.getTimeMillis())
> {code}
> That means an event handler might get a stage submitted event with an unset 
> submission time.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20205) DAGScheduler posts SparkListenerStageSubmitted before updating stage

2017-04-03 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15954340#comment-15954340
 ] 

Marcelo Vanzin commented on SPARK-20205:


bq. This is nasty ! This means submissionTime will always be unset ?

Well, it's a little more complicated than that.

The UI code currently "self heals", because it just keeps a pointer to the 
{{StageInfo}} object which is modified by the scheduler later. So eventually 
the UI sees the value.

But the event log, for example, might not have the submission time.

bq. Btw, is it possible for submissionTime to be set - but to an incorrect 
value ?

I wouldn't say incorrect; at worst it's gonna be slightly inaccurate.

> DAGScheduler posts SparkListenerStageSubmitted before updating stage
> 
>
> Key: SPARK-20205
> URL: https://issues.apache.org/jira/browse/SPARK-20205
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Marcelo Vanzin
>
> Probably affects other versions, haven't checked.
> The code that submits the event to the bus is around line 991:
> {code}
> stage.makeNewStageAttempt(partitionsToCompute.size, 
> taskIdToLocations.values.toSeq)
> listenerBus.post(SparkListenerStageSubmitted(stage.latestInfo, 
> properties))
> {code}
> Later in the same method, the stage information is updated (around line 1057):
> {code}
> if (tasks.size > 0) {
>   logInfo(s"Submitting ${tasks.size} missing tasks from $stage 
> (${stage.rdd}) (first 15 " +
> s"tasks are for partitions ${tasks.take(15).map(_.partitionId)})")
>   taskScheduler.submitTasks(new TaskSet(
> tasks.toArray, stage.id, stage.latestInfo.attemptId, jobId, 
> properties))
>   stage.latestInfo.submissionTime = Some(clock.getTimeMillis())
> {code}
> That means an event handler might get a stage submitted event with an unset 
> submission time.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20205) DAGScheduler posts SparkListenerStageSubmitted before updating stage

2017-04-03 Thread Mridul Muralidharan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15954333#comment-15954333
 ] 

Mridul Muralidharan commented on SPARK-20205:
-

This is nasty ! This means submissionTime will always be unset ?
Btw, is it possible for submissionTime to be set - but to an incorrect value ?

> DAGScheduler posts SparkListenerStageSubmitted before updating stage
> 
>
> Key: SPARK-20205
> URL: https://issues.apache.org/jira/browse/SPARK-20205
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Marcelo Vanzin
>
> Probably affects other versions, haven't checked.
> The code that submits the event to the bus is around line 991:
> {code}
> stage.makeNewStageAttempt(partitionsToCompute.size, 
> taskIdToLocations.values.toSeq)
> listenerBus.post(SparkListenerStageSubmitted(stage.latestInfo, 
> properties))
> {code}
> Later in the same method, the stage information is updated (around line 1057):
> {code}
> if (tasks.size > 0) {
>   logInfo(s"Submitting ${tasks.size} missing tasks from $stage 
> (${stage.rdd}) (first 15 " +
> s"tasks are for partitions ${tasks.take(15).map(_.partitionId)})")
>   taskScheduler.submitTasks(new TaskSet(
> tasks.toArray, stage.id, stage.latestInfo.attemptId, jobId, 
> properties))
>   stage.latestInfo.submissionTime = Some(clock.getTimeMillis())
> {code}
> That means an event handler might get a stage submitted event with an unset 
> submission time.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4899) Support Mesos features: roles and checkpoints

2017-04-03 Thread Kamal Gurala (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15954312#comment-15954312
 ] 

Kamal Gurala commented on SPARK-4899:
-

Some performance related concerns 
https://github.com/apache/spark/pull/60#r16817226

> Support Mesos features: roles and checkpoints
> -
>
> Key: SPARK-4899
> URL: https://issues.apache.org/jira/browse/SPARK-4899
> Project: Spark
>  Issue Type: New Feature
>  Components: Mesos
>Affects Versions: 1.2.0
>Reporter: Andrew Ash
>
> Inspired by https://github.com/apache/spark/pull/60
> Mesos has two features that would be nice for Spark to take advantage of:
> 1. Roles -- a way to specify ACLs and priorities for users
> 2. Checkpoints -- a way to restart a failed Mesos slave without losing all 
> the work that was happening on the box
> Some of these may require a Mesos upgrade past our current 0.18.1



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-20153) Support Multiple aws credentials in order to access multiple Hive on S3 table in spark application

2017-04-03 Thread Steve Loughran (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15954233#comment-15954233
 ] 

Steve Loughran edited comment on SPARK-20153 at 4/3/17 10:13 PM:
-

This is fixed in Hadoop 2.8 with [per-bucket 
configuration|http://hadoop.apache.org/docs/r2.8.0/hadoop-aws/tools/hadoop-aws/index.html#Configurations_different_S3_buckets];
 HADOOP-13336. 

I would *really* advise against trying to re-implement this in spark as having 
one consistent model for configuring s3a bindings everywhere as there are a lot 
more options than just credentials; the S3 endpoint being a critical one when 
trying to work with V4 auth endpoints.

As a temporary workaround, one which will leak your secrets to logs, know that 
you can go s3a://key:secret@bucket, URL encoding the secret, and so get access. 
Once you use this, consider all logs sensitive data.


was (Author: ste...@apache.org):
This is fixed in Hadoop 2.8 with [per-bucket 
configuration|http://hadoop.apache.org/docs/r2.8.0/hadoop-aws/tools/hadoop-aws/index.html#Configurations_different_S3_buckets];
 HADOOP-13336. 

I would *really* advise against trying to re-implement this in spark as having 
one consistent model for configuring s3a bindings everywhere will the only way 
to debug what's going on, especially given that for security reasons you can't 
log what's going on.

As a temporary workaround, one which will leak your secrets to logs, know that 
you can go s3a://key:secret@bucket, URL encoding the secret.

> Support Multiple aws credentials in order to access multiple Hive on S3 table 
> in spark application 
> ---
>
> Key: SPARK-20153
> URL: https://issues.apache.org/jira/browse/SPARK-20153
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.1, 2.1.0
>Reporter: Franck Tago
>Priority: Minor
>
> I need to access multiple hive tables in my spark application where each hive 
> table is 
> 1- an external table with data sitting on S3
> 2- each table is own by a different AWS user so I need to provide different 
> AWS credentials. 
> I am familiar with setting the aws credentials in the hadoop configuration 
> object but that does not really help me because I can only set one pair of 
> (fs.s3a.awsAccessKeyId , fs.s3a.awsSecretAccessKey )
> From my research , there is no easy or elegant way to do this in spark .
> Why is that ?  
> How do I address this use case?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20153) Support Multiple aws credentials in order to access multiple Hive on S3 table in spark application

2017-04-03 Thread Steve Loughran (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15954233#comment-15954233
 ] 

Steve Loughran commented on SPARK-20153:


This is fixed in Hadoop 2.8 with [per-bucket 
configuration|http://hadoop.apache.org/docs/r2.8.0/hadoop-aws/tools/hadoop-aws/index.html#Configurations_different_S3_buckets];
 HADOOP-13336. 

I would *really* advise against trying to re-implement this in spark as having 
one consistent model for configuring s3a bindings everywhere will the only way 
to debug what's going on, especially given that for security reasons you can't 
log what's going on.

As a temporary workaround, one which will leak your secrets to logs, know that 
you can go s3a://key:secret@bucket, URL encoding the secret.

> Support Multiple aws credentials in order to access multiple Hive on S3 table 
> in spark application 
> ---
>
> Key: SPARK-20153
> URL: https://issues.apache.org/jira/browse/SPARK-20153
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.1, 2.1.0
>Reporter: Franck Tago
>Priority: Minor
>
> I need to access multiple hive tables in my spark application where each hive 
> table is 
> 1- an external table with data sitting on S3
> 2- each table is own by a different AWS user so I need to provide different 
> AWS credentials. 
> I am familiar with setting the aws credentials in the hadoop configuration 
> object but that does not really help me because I can only set one pair of 
> (fs.s3a.awsAccessKeyId , fs.s3a.awsSecretAccessKey )
> From my research , there is no easy or elegant way to do this in spark .
> Why is that ?  
> How do I address this use case?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4899) Support Mesos features: roles and checkpoints

2017-04-03 Thread Charles Allen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15954212#comment-15954212
 ] 

Charles Allen commented on SPARK-4899:
--

It was discussed on the mailing list with [~timchen] that checkpointing might 
just need a timeout setting available to the other schedulers.

> Support Mesos features: roles and checkpoints
> -
>
> Key: SPARK-4899
> URL: https://issues.apache.org/jira/browse/SPARK-4899
> Project: Spark
>  Issue Type: New Feature
>  Components: Mesos
>Affects Versions: 1.2.0
>Reporter: Andrew Ash
>
> Inspired by https://github.com/apache/spark/pull/60
> Mesos has two features that would be nice for Spark to take advantage of:
> 1. Roles -- a way to specify ACLs and priorities for users
> 2. Checkpoints -- a way to restart a failed Mesos slave without losing all 
> the work that was happening on the box
> Some of these may require a Mesos upgrade past our current 0.18.1



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20064) Bump the PySpark verison number to 2.2

2017-04-03 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20064:


Assignee: (was: Apache Spark)

> Bump the PySpark verison number to 2.2
> --
>
> Key: SPARK-20064
> URL: https://issues.apache.org/jira/browse/SPARK-20064
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: holdenk
>Priority: Minor
>  Labels: starter
>
> The version.py should be updated for the new version. Note: this isn't 
> critical since for any releases made with make-distribution the version 
> number is read from the xml, but if anyone builds from source and manually 
> looks at the version # it would be good to have it match. This is a good 
> starter issue, but something we should do quickly.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20064) Bump the PySpark verison number to 2.2

2017-04-03 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20064:


Assignee: Apache Spark

> Bump the PySpark verison number to 2.2
> --
>
> Key: SPARK-20064
> URL: https://issues.apache.org/jira/browse/SPARK-20064
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: holdenk
>Assignee: Apache Spark
>Priority: Minor
>  Labels: starter
>
> The version.py should be updated for the new version. Note: this isn't 
> critical since for any releases made with make-distribution the version 
> number is read from the xml, but if anyone builds from source and manually 
> looks at the version # it would be good to have it match. This is a good 
> starter issue, but something we should do quickly.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20064) Bump the PySpark verison number to 2.2

2017-04-03 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15954186#comment-15954186
 ] 

Apache Spark commented on SPARK-20064:
--

User 'setjet' has created a pull request for this issue:
https://github.com/apache/spark/pull/17523

> Bump the PySpark verison number to 2.2
> --
>
> Key: SPARK-20064
> URL: https://issues.apache.org/jira/browse/SPARK-20064
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: holdenk
>Priority: Minor
>  Labels: starter
>
> The version.py should be updated for the new version. Note: this isn't 
> critical since for any releases made with make-distribution the version 
> number is read from the xml, but if anyone builds from source and manually 
> looks at the version # it would be good to have it match. This is a good 
> starter issue, but something we should do quickly.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4899) Support Mesos features: roles and checkpoints

2017-04-03 Thread Charles Allen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15954170#comment-15954170
 ] 

Charles Allen commented on SPARK-4899:
--

{{org.apache.spark.scheduler.cluster.mesos.MesosSchedulerUtils#createSchedulerDriver}}
 seems to allow checkpointing, which only 
{{org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler}} uses. 
Neither the fine grained nor coarse grained schedulers use it, is there a 
reason for that?

> Support Mesos features: roles and checkpoints
> -
>
> Key: SPARK-4899
> URL: https://issues.apache.org/jira/browse/SPARK-4899
> Project: Spark
>  Issue Type: New Feature
>  Components: Mesos
>Affects Versions: 1.2.0
>Reporter: Andrew Ash
>
> Inspired by https://github.com/apache/spark/pull/60
> Mesos has two features that would be nice for Spark to take advantage of:
> 1. Roles -- a way to specify ACLs and priorities for users
> 2. Checkpoints -- a way to restart a failed Mesos slave without losing all 
> the work that was happening on the box
> Some of these may require a Mesos upgrade past our current 0.18.1



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-20205) DAGScheduler posts SparkListenerStageSubmitted before updating stage

2017-04-03 Thread Marcelo Vanzin (JIRA)

Marcelo Vanzin created SPARK-20205:
--

 Summary: DAGScheduler posts SparkListenerStageSubmitted before 
updating stage
 Key: SPARK-20205
 URL: https://issues.apache.org/jira/browse/SPARK-20205
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.2.0
Reporter: Marcelo Vanzin


Probably affects other versions, haven't checked.

The code that submits the event to the bus is around line 991:

{code}
stage.makeNewStageAttempt(partitionsToCompute.size, 
taskIdToLocations.values.toSeq)
listenerBus.post(SparkListenerStageSubmitted(stage.latestInfo, properties))
{code}

Later in the same method, the stage information is updated (around line 1057):

{code}
if (tasks.size > 0) {
  logInfo(s"Submitting ${tasks.size} missing tasks from $stage 
(${stage.rdd}) (first 15 " +
s"tasks are for partitions ${tasks.take(15).map(_.partitionId)})")
  taskScheduler.submitTasks(new TaskSet(
tasks.toArray, stage.id, stage.latestInfo.attemptId, jobId, properties))
  stage.latestInfo.submissionTime = Some(clock.getTimeMillis())
{code}

That means an event handler might get a stage submitted event with an unset 
submission time.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18278) Support native submission of spark jobs to a kubernetes cluster

2017-04-03 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15954132#comment-15954132
 ] 

Apache Spark commented on SPARK-18278:
--

User 'foxish' has created a pull request for this issue:
https://github.com/apache/spark/pull/17522

> Support native submission of spark jobs to a kubernetes cluster
> ---
>
> Key: SPARK-18278
> URL: https://issues.apache.org/jira/browse/SPARK-18278
> Project: Spark
>  Issue Type: Umbrella
>  Components: Build, Deploy, Documentation, Scheduler, Spark Core
>Reporter: Erik Erlandson
> Attachments: SPARK-18278 - Spark on Kubernetes Design Proposal.pdf
>
>
> A new Apache Spark sub-project that enables native support for submitting 
> Spark applications to a kubernetes cluster.   The submitted application runs 
> in a driver executing on a kubernetes pod, and executors lifecycles are also 
> managed as pods.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20176) Spark Dataframe UDAF issue

2017-04-03 Thread Kazuaki Ishizaki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15954093#comment-15954093
 ] 

Kazuaki Ishizaki commented on SPARK-20176:
--

Thanks. The code seem to work for the master.
I am investigating which change fixes the issue.

> Spark Dataframe UDAF issue
> --
>
> Key: SPARK-20176
> URL: https://issues.apache.org/jira/browse/SPARK-20176
> Project: Spark
>  Issue Type: IT Help
>  Components: Spark Core
>Affects Versions: 2.0.2
>Reporter: Dinesh Man Amatya
>
> Getting following error in custom UDAF
> Error while decoding: java.util.concurrent.ExecutionException: 
> java.lang.Exception: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 58, Column 33: Incompatible expression types "boolean" and "java.lang.Boolean"
> /* 001 */ public java.lang.Object generate(Object[] references) {
> /* 002 */   return new SpecificSafeProjection(references);
> /* 003 */ }
> /* 004 */
> /* 005 */ class SpecificSafeProjection extends 
> org.apache.spark.sql.catalyst.expressions.codegen.BaseProjection {
> /* 006 */
> /* 007 */   private Object[] references;
> /* 008 */   private MutableRow mutableRow;
> /* 009 */   private Object[] values;
> /* 010 */   private Object[] values1;
> /* 011 */   private org.apache.spark.sql.types.StructType schema;
> /* 012 */   private org.apache.spark.sql.types.StructType schema1;
> /* 013 */
> /* 014 */
> /* 015 */   public SpecificSafeProjection(Object[] references) {
> /* 016 */ this.references = references;
> /* 017 */ mutableRow = (MutableRow) references[references.length - 1];
> /* 018 */
> /* 019 */
> /* 020 */ this.schema = (org.apache.spark.sql.types.StructType) 
> references[0];
> /* 021 */ this.schema1 = (org.apache.spark.sql.types.StructType) 
> references[1];
> /* 022 */   }
> /* 023 */
> /* 024 */   public java.lang.Object apply(java.lang.Object _i) {
> /* 025 */ InternalRow i = (InternalRow) _i;
> /* 026 */
> /* 027 */ values = new Object[2];
> /* 028 */
> /* 029 */ boolean isNull2 = i.isNullAt(0);
> /* 030 */ UTF8String value2 = isNull2 ? null : (i.getUTF8String(0));
> /* 031 */
> /* 032 */ boolean isNull1 = isNull2;
> /* 033 */ final java.lang.String value1 = isNull1 ? null : 
> (java.lang.String) value2.toString();
> /* 034 */ isNull1 = value1 == null;
> /* 035 */ if (isNull1) {
> /* 036 */   values[0] = null;
> /* 037 */ } else {
> /* 038 */   values[0] = value1;
> /* 039 */ }
> /* 040 */
> /* 041 */ boolean isNull5 = i.isNullAt(1);
> /* 042 */ InternalRow value5 = isNull5 ? null : (i.getStruct(1, 2));
> /* 043 */ boolean isNull3 = false;
> /* 044 */ org.apache.spark.sql.Row value3 = null;
> /* 045 */ if (!false && isNull5) {
> /* 046 */
> /* 047 */   final org.apache.spark.sql.Row value6 = null;
> /* 048 */   isNull3 = true;
> /* 049 */   value3 = value6;
> /* 050 */ } else {
> /* 051 */
> /* 052 */   values1 = new Object[2];
> /* 053 */
> /* 054 */   boolean isNull10 = i.isNullAt(1);
> /* 055 */   InternalRow value10 = isNull10 ? null : (i.getStruct(1, 2));
> /* 056 */
> /* 057 */   boolean isNull9 = isNull10 || false;
> /* 058 */   final boolean value9 = isNull9 ? false : (Boolean) 
> value10.isNullAt(0);
> /* 059 */   boolean isNull8 = false;
> /* 060 */   double value8 = -1.0;
> /* 061 */   if (!isNull9 && value9) {
> /* 062 */
> /* 063 */ final double value12 = -1.0;
> /* 064 */ isNull8 = true;
> /* 065 */ value8 = value12;
> /* 066 */   } else {
> /* 067 */
> /* 068 */ boolean isNull14 = i.isNullAt(1);
> /* 069 */ InternalRow value14 = isNull14 ? null : (i.getStruct(1, 2));
> /* 070 */ boolean isNull13 = isNull14;
> /* 071 */ double value13 = -1.0;
> /* 072 */
> /* 073 */ if (!isNull14) {
> /* 074 */
> /* 075 */   if (value14.isNullAt(0)) {
> /* 076 */ isNull13 = true;
> /* 077 */   } else {
> /* 078 */ value13 = value14.getDouble(0);
> /* 079 */   }
> /* 080 */
> /* 081 */ }
> /* 082 */ isNull8 = isNull13;
> /* 083 */ value8 = value13;
> /* 084 */   }
> /* 085 */   if (isNull8) {
> /* 086 */ values1[0] = null;
> /* 087 */   } else {
> /* 088 */ values1[0] = value8;
> /* 089 */   }
> /* 090 */
> /* 091 */   boolean isNull17 = i.isNullAt(1);
> /* 092 */   InternalRow value17 = isNull17 ? null : (i.getStruct(1, 2));
> /* 093 */
> /* 094 */   boolean isNull16 = isNull17 || false;
> /* 095 */   final boolean value16 = isNull16 ? false : (Boolean) 
> value17.isNullAt(1);
> /* 096 */   boolean isNull15 = false;
> /* 097 */   double value15 = -1.0;
> /* 098 */   if (!isNull16 && value16) {
>

[jira] [Comment Edited] (SPARK-20176) Spark Dataframe UDAF issue

2017-04-03 Thread Kazuaki Ishizaki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15954093#comment-15954093
 ] 

Kazuaki Ishizaki edited comment on SPARK-20176 at 4/3/17 8:13 PM:
--

Thanks. The code seem to work for the master.
I am investigating which change fixed the issue.


was (Author: kiszk):
Thanks. The code seem to work for the master.
I am investigating which change fixes the issue.

> Spark Dataframe UDAF issue
> --
>
> Key: SPARK-20176
> URL: https://issues.apache.org/jira/browse/SPARK-20176
> Project: Spark
>  Issue Type: IT Help
>  Components: Spark Core
>Affects Versions: 2.0.2
>Reporter: Dinesh Man Amatya
>
> Getting following error in custom UDAF
> Error while decoding: java.util.concurrent.ExecutionException: 
> java.lang.Exception: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 58, Column 33: Incompatible expression types "boolean" and "java.lang.Boolean"
> /* 001 */ public java.lang.Object generate(Object[] references) {
> /* 002 */   return new SpecificSafeProjection(references);
> /* 003 */ }
> /* 004 */
> /* 005 */ class SpecificSafeProjection extends 
> org.apache.spark.sql.catalyst.expressions.codegen.BaseProjection {
> /* 006 */
> /* 007 */   private Object[] references;
> /* 008 */   private MutableRow mutableRow;
> /* 009 */   private Object[] values;
> /* 010 */   private Object[] values1;
> /* 011 */   private org.apache.spark.sql.types.StructType schema;
> /* 012 */   private org.apache.spark.sql.types.StructType schema1;
> /* 013 */
> /* 014 */
> /* 015 */   public SpecificSafeProjection(Object[] references) {
> /* 016 */ this.references = references;
> /* 017 */ mutableRow = (MutableRow) references[references.length - 1];
> /* 018 */
> /* 019 */
> /* 020 */ this.schema = (org.apache.spark.sql.types.StructType) 
> references[0];
> /* 021 */ this.schema1 = (org.apache.spark.sql.types.StructType) 
> references[1];
> /* 022 */   }
> /* 023 */
> /* 024 */   public java.lang.Object apply(java.lang.Object _i) {
> /* 025 */ InternalRow i = (InternalRow) _i;
> /* 026 */
> /* 027 */ values = new Object[2];
> /* 028 */
> /* 029 */ boolean isNull2 = i.isNullAt(0);
> /* 030 */ UTF8String value2 = isNull2 ? null : (i.getUTF8String(0));
> /* 031 */
> /* 032 */ boolean isNull1 = isNull2;
> /* 033 */ final java.lang.String value1 = isNull1 ? null : 
> (java.lang.String) value2.toString();
> /* 034 */ isNull1 = value1 == null;
> /* 035 */ if (isNull1) {
> /* 036 */   values[0] = null;
> /* 037 */ } else {
> /* 038 */   values[0] = value1;
> /* 039 */ }
> /* 040 */
> /* 041 */ boolean isNull5 = i.isNullAt(1);
> /* 042 */ InternalRow value5 = isNull5 ? null : (i.getStruct(1, 2));
> /* 043 */ boolean isNull3 = false;
> /* 044 */ org.apache.spark.sql.Row value3 = null;
> /* 045 */ if (!false && isNull5) {
> /* 046 */
> /* 047 */   final org.apache.spark.sql.Row value6 = null;
> /* 048 */   isNull3 = true;
> /* 049 */   value3 = value6;
> /* 050 */ } else {
> /* 051 */
> /* 052 */   values1 = new Object[2];
> /* 053 */
> /* 054 */   boolean isNull10 = i.isNullAt(1);
> /* 055 */   InternalRow value10 = isNull10 ? null : (i.getStruct(1, 2));
> /* 056 */
> /* 057 */   boolean isNull9 = isNull10 || false;
> /* 058 */   final boolean value9 = isNull9 ? false : (Boolean) 
> value10.isNullAt(0);
> /* 059 */   boolean isNull8 = false;
> /* 060 */   double value8 = -1.0;
> /* 061 */   if (!isNull9 && value9) {
> /* 062 */
> /* 063 */ final double value12 = -1.0;
> /* 064 */ isNull8 = true;
> /* 065 */ value8 = value12;
> /* 066 */   } else {
> /* 067 */
> /* 068 */ boolean isNull14 = i.isNullAt(1);
> /* 069 */ InternalRow value14 = isNull14 ? null : (i.getStruct(1, 2));
> /* 070 */ boolean isNull13 = isNull14;
> /* 071 */ double value13 = -1.0;
> /* 072 */
> /* 073 */ if (!isNull14) {
> /* 074 */
> /* 075 */   if (value14.isNullAt(0)) {
> /* 076 */ isNull13 = true;
> /* 077 */   } else {
> /* 078 */ value13 = value14.getDouble(0);
> /* 079 */   }
> /* 080 */
> /* 081 */ }
> /* 082 */ isNull8 = isNull13;
> /* 083 */ value8 = value13;
> /* 084 */   }
> /* 085 */   if (isNull8) {
> /* 086 */ values1[0] = null;
> /* 087 */   } else {
> /* 088 */ values1[0] = value8;
> /* 089 */   }
> /* 090 */
> /* 091 */   boolean isNull17 = i.isNullAt(1);
> /* 092 */   InternalRow value17 = isNull17 ? null : (i.getStruct(1, 2));
> /* 093 */
> /* 094 */   boolean isNull16 = isNull17 || false;
> /* 095 */   final boolean value16 = isNull16 ? false :

[jira] [Commented] (SPARK-19659) Fetch big blocks to disk when shuffle-read

2017-04-03 Thread Wenchen Fan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15953968#comment-15953968
 ] 

Wenchen Fan commented on SPARK-19659:
-

What's the smallest unit of fetching remote shuffle blocks?

If the unit is block, I think it's really hard to avoid OOM entirely, as if the 
estimated block size is wrong, fetching this block may cause OOM and we can do 
nothing about it. (I guess that's why you add 
{{spark.reducer.maxBytesShuffleToMemory}} in your PR.)

If the unit can be smaller like a byte buffer, and we can fully track and 
control the shuffle fetch memory usage, I think then we can solve the OOM 
problem pretty good without introducing new config to users. Is it possible to 
do it with some advanced netty API?

> Fetch big blocks to disk when shuffle-read
> --
>
> Key: SPARK-19659
> URL: https://issues.apache.org/jira/browse/SPARK-19659
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 2.1.0
>Reporter: jin xing
> Attachments: SPARK-19659-design-v1.pdf, SPARK-19659-design-v2.pdf
>
>
> Currently the whole block is fetched into memory(offheap by default) when 
> shuffle-read. A block is defined by (shuffleId, mapId, reduceId). Thus it can 
> be large when skew situations. If OOM happens during shuffle read, job will 
> be killed and users will be notified to "Consider boosting 
> spark.yarn.executor.memoryOverhead". Adjusting parameter and allocating more 
> memory can resolve the OOM. However the approach is not perfectly suitable 
> for production environment, especially for data warehouse.
> Using Spark SQL as data engine in warehouse, users hope to have a unified 
> parameter(e.g. memory) but less resource wasted(resource is allocated but not 
> used),
> It's not always easy to predict skew situations, when happen, it make sense 
> to fetch remote blocks to disk for shuffle-read, rather than
> kill the job because of OOM. This approach is mentioned during the discussion 
> in SPARK-3019, by [~sandyr] and [~mridulm80]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20204) separate SQLConf into catalyst confs and sql confs

2017-04-03 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15953930#comment-15953930
 ] 

Apache Spark commented on SPARK-20204:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/17521

> separate SQLConf into catalyst confs and sql confs
> --
>
> Key: SPARK-20204
> URL: https://issues.apache.org/jira/browse/SPARK-20204
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20204) separate SQLConf into catalyst confs and sql confs

2017-04-03 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20204:


Assignee: Apache Spark  (was: Wenchen Fan)

> separate SQLConf into catalyst confs and sql confs
> --
>
> Key: SPARK-20204
> URL: https://issues.apache.org/jira/browse/SPARK-20204
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20204) separate SQLConf into catalyst confs and sql confs

2017-04-03 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20204:


Assignee: Wenchen Fan  (was: Apache Spark)

> separate SQLConf into catalyst confs and sql confs
> --
>
> Key: SPARK-20204
> URL: https://issues.apache.org/jira/browse/SPARK-20204
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-20204) separate SQLConf into catalyst confs and sql confs

2017-04-03 Thread Wenchen Fan (JIRA)

Wenchen Fan created SPARK-20204:
---

 Summary: separate SQLConf into catalyst confs and sql confs
 Key: SPARK-20204
 URL: https://issues.apache.org/jira/browse/SPARK-20204
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.2.0
Reporter: Wenchen Fan
Assignee: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19979) [MLLIB] Multiple Estimators/Pipelines In CrossValidator

2017-04-03 Thread Bryan Cutler (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15953820#comment-15953820
 ] 

Bryan Cutler commented on SPARK-19979:
--

>From the discussion in the PR

{noformat}
val tokenizer = new Tokenizer()
  .setInputCol("text")
  .setOutputCol("words")
val hashingTF = new HashingTF()
  .setInputCol(tokenizer.getOutputCol)
  .setOutputCol("features")
val lr = new LogisticRegression()
  .setMaxIter(10)
val dt = new DecisionTreeClassifier()
  .setMaxDepth(5)
val pipeline = new Pipeline()

val pipeline1: Array[PipelineStage] = Array(tokenizer, hashingTF, lr)
val pipeline2: Array[PipelineStage] = Array(tokenizer, hashingTF, dt)

val pipeline1_grid = new ParamGridBuilder()
  .baseOn(pipeline.stages -> pipeline1)
  .addGrid(hashingTF.numFeatures, Array(10, 100, 1000))
  .addGrid(lr.regParam, Array(0.1, 0.01))
  .build()

val pipeline2_grid = new ParamGridBuilder()
  .baseOn(pipeline.stages -> pipeline2)
  .addGrid(hashingTF.numFeatures, Array(10, 100, 1000))
  .build()

val paramGrid = pipeline1_grid ++ pipeline2_grid

val cv = new CrossValidator()
  .setEstimator(pipeline)
  .setEvaluator(new BinaryClassificationEvaluator)
  .setEstimatorParamMaps(paramGrid)
  .setNumFolds(2)  // Use 3+ in practice
{noformat}

[~josephkb] [~mlnick] would this be good to add to the documentation?

> [MLLIB] Multiple Estimators/Pipelines In CrossValidator
> ---
>
> Key: SPARK-19979
> URL: https://issues.apache.org/jira/browse/SPARK-19979
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.1.0
>Reporter: David Leifker
>
> Update CrossValidator and TrainValidationSplit to be able to accept multiple 
> pipelines and grid parameters for testing different algorithms and/or being 
> able to better control tuning combinations. Maintains backwards compatible 
> API and reads legacy serialized objects.
> The same could be done using an external iterative approach. Build different 
> pipelines, throwing each into a CrossValidator, and then taking the best 
> model from each of those CrossValidators. Then finally picking the best from 
> those. This is the initial approach I explored. It resulted in a lot of 
> boiler plate code that felt like it shouldn't need to exist if the api simply 
> allowed for arrays of estimators and their parameters.
> A couple advantages to this implementation to consider come from keeping the 
> functional interface to the CrossValidator.
> 1. The caching of the folds is better utilized. An external iterative 
> approach creates a new set of k folds for each CrossValidator fit and the 
> folds are discarded after each CrossValidator run. In this implementation a 
> single set of k folds is created and cached for all of the pipelines.
> 2. A potential advantage of using this implementation is for future 
> parallelization of the pipelines within the CrossValdiator. It is of course 
> possible to handle the parallelization outside of the CrossValidator here 
> too, however I believe there is already work in progress to parallelize the 
> grid parameters and that could be extended to multiple pipelines.
> Both of those behind-the-scene optimizations are possible because of 
> providing the CrossValidator with the data and the complete set of 
> pipelines/estimators to evaluate up front allowing one to abstract away the 
> implementation.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19712) EXISTS and Left Semi join do not produce the same plan

2017-04-03 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15953793#comment-15953793
 ] 

Apache Spark commented on SPARK-19712:
--

User 'nsyca' has created a pull request for this issue:
https://github.com/apache/spark/pull/17520

> EXISTS and Left Semi join do not produce the same plan
> --
>
> Key: SPARK-19712
> URL: https://issues.apache.org/jira/browse/SPARK-19712
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Nattavut Sutyanyong
>
> This problem was found during the development of SPARK-18874.
> The EXISTS form in the following query:
> {{sql("select * from t1 inner join t2 on t1.t1a=t2.t2a where exists (select 1 
> from t3 where t1.t1b=t3.t3b)")}}
> gives the optimized plan below:
> {code}
> == Optimized Logical Plan ==
> Join Inner, (t1a#7 = t2a#25)
> :- Join LeftSemi, (t1b#8 = t3b#58)
> :  :- Filter isnotnull(t1a#7)
> :  :  +- Relation[t1a#7,t1b#8,t1c#9] parquet
> :  +- Project [1 AS 1#271, t3b#58]
> : +- Relation[t3a#57,t3b#58,t3c#59] parquet
> +- Filter isnotnull(t2a#25)
>+- Relation[t2a#25,t2b#26,t2c#27] parquet
> {code}
> whereas a semantically equivalent Left Semi join query below:
> {{sql("select * from t1 inner join t2 on t1.t1a=t2.t2a left semi join t3 on 
> t1.t1b=t3.t3b")}}
> gives the following optimized plan:
> {code}
> == Optimized Logical Plan ==
> Join LeftSemi, (t1b#8 = t3b#58)
> :- Join Inner, (t1a#7 = t2a#25)
> :  :- Filter (isnotnull(t1b#8) && isnotnull(t1a#7))
> :  :  +- Relation[t1a#7,t1b#8,t1c#9] parquet
> :  +- Filter isnotnull(t2a#25)
> : +- Relation[t2a#25,t2b#26,t2c#27] parquet
> +- Project [t3b#58]
>+- Relation[t3a#57,t3b#58,t3c#59] parquet
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19712) EXISTS and Left Semi join do not produce the same plan

2017-04-03 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19712:


Assignee: Apache Spark

> EXISTS and Left Semi join do not produce the same plan
> --
>
> Key: SPARK-19712
> URL: https://issues.apache.org/jira/browse/SPARK-19712
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Nattavut Sutyanyong
>Assignee: Apache Spark
>
> This problem was found during the development of SPARK-18874.
> The EXISTS form in the following query:
> {{sql("select * from t1 inner join t2 on t1.t1a=t2.t2a where exists (select 1 
> from t3 where t1.t1b=t3.t3b)")}}
> gives the optimized plan below:
> {code}
> == Optimized Logical Plan ==
> Join Inner, (t1a#7 = t2a#25)
> :- Join LeftSemi, (t1b#8 = t3b#58)
> :  :- Filter isnotnull(t1a#7)
> :  :  +- Relation[t1a#7,t1b#8,t1c#9] parquet
> :  +- Project [1 AS 1#271, t3b#58]
> : +- Relation[t3a#57,t3b#58,t3c#59] parquet
> +- Filter isnotnull(t2a#25)
>+- Relation[t2a#25,t2b#26,t2c#27] parquet
> {code}
> whereas a semantically equivalent Left Semi join query below:
> {{sql("select * from t1 inner join t2 on t1.t1a=t2.t2a left semi join t3 on 
> t1.t1b=t3.t3b")}}
> gives the following optimized plan:
> {code}
> == Optimized Logical Plan ==
> Join LeftSemi, (t1b#8 = t3b#58)
> :- Join Inner, (t1a#7 = t2a#25)
> :  :- Filter (isnotnull(t1b#8) && isnotnull(t1a#7))
> :  :  +- Relation[t1a#7,t1b#8,t1c#9] parquet
> :  +- Filter isnotnull(t2a#25)
> : +- Relation[t2a#25,t2b#26,t2c#27] parquet
> +- Project [t3b#58]
>+- Relation[t3a#57,t3b#58,t3c#59] parquet
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19712) EXISTS and Left Semi join do not produce the same plan

2017-04-03 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19712:


Assignee: (was: Apache Spark)

> EXISTS and Left Semi join do not produce the same plan
> --
>
> Key: SPARK-19712
> URL: https://issues.apache.org/jira/browse/SPARK-19712
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Nattavut Sutyanyong
>
> This problem was found during the development of SPARK-18874.
> The EXISTS form in the following query:
> {{sql("select * from t1 inner join t2 on t1.t1a=t2.t2a where exists (select 1 
> from t3 where t1.t1b=t3.t3b)")}}
> gives the optimized plan below:
> {code}
> == Optimized Logical Plan ==
> Join Inner, (t1a#7 = t2a#25)
> :- Join LeftSemi, (t1b#8 = t3b#58)
> :  :- Filter isnotnull(t1a#7)
> :  :  +- Relation[t1a#7,t1b#8,t1c#9] parquet
> :  +- Project [1 AS 1#271, t3b#58]
> : +- Relation[t3a#57,t3b#58,t3c#59] parquet
> +- Filter isnotnull(t2a#25)
>+- Relation[t2a#25,t2b#26,t2c#27] parquet
> {code}
> whereas a semantically equivalent Left Semi join query below:
> {{sql("select * from t1 inner join t2 on t1.t1a=t2.t2a left semi join t3 on 
> t1.t1b=t3.t3b")}}
> gives the following optimized plan:
> {code}
> == Optimized Logical Plan ==
> Join LeftSemi, (t1b#8 = t3b#58)
> :- Join Inner, (t1a#7 = t2a#25)
> :  :- Filter (isnotnull(t1b#8) && isnotnull(t1a#7))
> :  :  +- Relation[t1a#7,t1b#8,t1c#9] parquet
> :  +- Filter isnotnull(t2a#25)
> : +- Relation[t2a#25,t2b#26,t2c#27] parquet
> +- Project [t3b#58]
>+- Relation[t3a#57,t3b#58,t3c#59] parquet
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20047) Constrained Logistic Regression

2017-04-03 Thread DB Tsai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15953728#comment-15953728
 ] 

DB Tsai commented on SPARK-20047:
-

I changed the target to 2.3.0 Thanks.

> Constrained Logistic Regression
> ---
>
> Key: SPARK-20047
> URL: https://issues.apache.org/jira/browse/SPARK-20047
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 2.2.0
>Reporter: DB Tsai
>Assignee: Yanbo Liang
>
> For certain applications, such as stacked regressions, it is important to put 
> non-negative constraints on the regression coefficients. Also, if the ranges 
> of coefficients are known, it makes sense to constrain the coefficient search 
> space.
> Fitting generalized constrained regression models object to Cβ ≤ b, where C ∈ 
> R^\{m×p\} and b ∈ R^\{m\} are predefined matrices and vectors which places a 
> set of m linear constraints on the coefficients is very challenging as 
> discussed in many literatures. 
> However, for box constraints on the coefficients, the optimization is well 
> solved. For gradient descent, people can projected gradient descent in the 
> primal by zeroing the negative weights at each step. For LBFGS, an extended 
> version of it, LBFGS-B can handle large scale box optimization efficiently. 
> Unfortunately, for OWLQN, there is no good efficient way to do optimization 
> with box constrains.
> As a result, in this work, we only implement constrained LR with box 
> constrains without L1 regularization. 
> Note that since we standardize the data in training phase, so the 
> coefficients seen in the optimization subroutine are in the scaled space; as 
> a result, we need to convert the box constrains into scaled space.
> Users will be able to set the lower / upper bounds of each coefficients and 
> intercepts.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20047) Constrained Logistic Regression

2017-04-03 Thread DB Tsai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai updated SPARK-20047:

Affects Version/s: (was: 2.1.0)
   2.2.0
 Target Version/s: 2.3.0  (was: 2.2.0)

> Constrained Logistic Regression
> ---
>
> Key: SPARK-20047
> URL: https://issues.apache.org/jira/browse/SPARK-20047
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 2.2.0
>Reporter: DB Tsai
>Assignee: Yanbo Liang
>
> For certain applications, such as stacked regressions, it is important to put 
> non-negative constraints on the regression coefficients. Also, if the ranges 
> of coefficients are known, it makes sense to constrain the coefficient search 
> space.
> Fitting generalized constrained regression models object to Cβ ≤ b, where C ∈ 
> R^\{m×p\} and b ∈ R^\{m\} are predefined matrices and vectors which places a 
> set of m linear constraints on the coefficients is very challenging as 
> discussed in many literatures. 
> However, for box constraints on the coefficients, the optimization is well 
> solved. For gradient descent, people can projected gradient descent in the 
> primal by zeroing the negative weights at each step. For LBFGS, an extended 
> version of it, LBFGS-B can handle large scale box optimization efficiently. 
> Unfortunately, for OWLQN, there is no good efficient way to do optimization 
> with box constrains.
> As a result, in this work, we only implement constrained LR with box 
> constrains without L1 regularization. 
> Note that since we standardize the data in training phase, so the 
> coefficients seen in the optimization subroutine are in the scaled space; as 
> a result, we need to convert the box constrains into scaled space.
> Users will be able to set the lower / upper bounds of each coefficients and 
> intercepts.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20193) Selecting empty struct causes ExpressionEncoder error.

2017-04-03 Thread Adrian Ionescu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15953704#comment-15953704
 ] 

Adrian Ionescu commented on SPARK-20193:


cc [~hvanhovell]

> Selecting empty struct causes ExpressionEncoder error.
> --
>
> Key: SPARK-20193
> URL: https://issues.apache.org/jira/browse/SPARK-20193
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Adrian Ionescu
>  Labels: struct
>
> {{def struct(cols: Column*): Column}}
> Given the above signature and the lack of any note in the docs saying that a 
> struct with no columns is not supported, I would expect the following to work:
> {{spark.range(3).select(col("id"), struct().as("empty_struct")).collect}}
> However, this results in:
> {quote}
> java.lang.AssertionError: assertion failed: each serializer expression should 
> contains at least one `BoundReference`
>   at scala.Predef$.assert(Predef.scala:170)
>   at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$$anonfun$11.apply(ExpressionEncoder.scala:240)
>   at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$$anonfun$11.apply(ExpressionEncoder.scala:238)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
>   at scala.collection.immutable.List.flatMap(List.scala:344)
>   at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.(ExpressionEncoder.scala:238)
>   at 
> org.apache.spark.sql.catalyst.encoders.RowEncoder$.apply(RowEncoder.scala:63)
>   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64)
>   at 
> org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withPlan(Dataset.scala:2837)
>   at org.apache.spark.sql.Dataset.select(Dataset.scala:1131)
>   ... 39 elided
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-20194) Support partition pruning for InMemoryCatalog

2017-04-03 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-20194.
-
   Resolution: Fixed
 Assignee: Adrian Ionescu
Fix Version/s: 2.2.0

> Support partition pruning for InMemoryCatalog
> -
>
> Key: SPARK-20194
> URL: https://issues.apache.org/jira/browse/SPARK-20194
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer
>Affects Versions: 2.1.0
>Reporter: Adrian Ionescu
>Assignee: Adrian Ionescu
> Fix For: 2.2.0
>
>
> {{listPartitionsByFilter()}} is not yet implemented for {{InMemoryCatalog}}:
> {quote}
>  // TODO: Provide an implementation
> throw new UnsupportedOperationException(
>   "listPartitionsByFilter is not implemented for InMemoryCatalog")
> {quote}
> Because of this, there is a hack in {{FindDataSourceTable}} that avoids 
> passing along the {{CatalogTable}} to the {{DataSource}} it creates when the 
> catalog implementation is not "hive", so that, when the latter is resolved, 
> an {{InMemoryFileIndex}} is created instead of a {{CatalogFileIndex}} which 
> the {{PruneFileSourcePartitions}} rule matches for.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-20199) GradientBoostedTreesModel doesn't have Column Sampling Rate Paramenter

2017-04-03 Thread Arush Kharbanda (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arush Kharbanda updated SPARK-20199:

Comment: was deleted

(was: I will work on this issue.)

> GradientBoostedTreesModel doesn't have  Column Sampling Rate Paramenter
> ---
>
> Key: SPARK-20199
> URL: https://issues.apache.org/jira/browse/SPARK-20199
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.1.0
>Reporter: pralabhkumar
>Priority: Minor
>
> Spark GradientBoostedTreesModel doesn't have Column  sampling rate parameter 
> . This parameter is available in H2O and XGBoost. 
> Sample from H2O.ai 
> gbmParams._col_sample_rate
> Please provide the parameter . 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11783) When deployed against remote Hive metastore, HiveContext.executionHive points to wrong metastore

2017-04-03 Thread Jonathan Maron (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15953647#comment-15953647
 ] 

Jonathan Maron commented on SPARK-11783:


I am running a spark job and, when instantiating a HiveContext, I see that the 
client creates a local derby-based metastore.  Is this the intent for client 
processes?  I don't understand the necessity for a client process to create a 
metastore instance rather than leverage the remote metastore server.

> When deployed against remote Hive metastore, HiveContext.executionHive points 
> to wrong metastore
> 
>
> Key: SPARK-11783
> URL: https://issues.apache.org/jira/browse/SPARK-11783
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1, 1.6.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>Priority: Critical
> Fix For: 1.6.0
>
>
> When using remote metastore, execution Hive client somehow is initialized to 
> point to the actual remote metastore instead of the dummy local Derby 
> metastore.
> To reproduce this issue:
> # Configuring {{conf/hive-site.xml}} to point to a remote Hive 1.2.1 
> metastore.
> # Set {{hive.metastore.uris}} to {{thrift://localhost:9083}}.
> # Start metastore service using {{$HIVE_HOME/bin/hive --service metastore}}
> # Start Thrift server with remote debugging options
> # Attach the debugger to the Thrift server driver process, we can verify that 
> {{executionHive}} points to the remote metastore rather than the local 
> execution Derby metastore.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9272) Persist information of individual partitions when persisting partitioned data source tables to metastore

2017-04-03 Thread Daniel Tomes (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15953582#comment-15953582
 ] 

Daniel Tomes commented on SPARK-9272:
-

BUMP

This is an important issue. Let's get this resolved.

> Persist information of individual partitions when persisting partitioned data 
> source tables to metastore
> 
>
> Key: SPARK-9272
> URL: https://issues.apache.org/jira/browse/SPARK-9272
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Cheng Lian
>
> Currently, when a partitioned data source table is persisted to Hive 
> metastore, we only persist its partition columns. Information about 
> individual partitions are not persisted. This forces us to do a partition 
> discovery before reading a persisted partitioned table, which hurts 
> performance.
> To fix this issue, we may persist partition information into metastore. 
> Specifically, the format should be compatible with Hive to ensure 
> interoperability.
> One of the approach to collect partition values and partition directory path 
> for dynamicly partitioned tables is to use accumulators to collect expected 
> information during the write job.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20047) Constrained Logistic Regression

2017-04-03 Thread Nick Pentreath (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15953551#comment-15953551
 ] 

Nick Pentreath commented on SPARK-20047:


Is this really targeted for 2.2.0? 

> Constrained Logistic Regression
> ---
>
> Key: SPARK-20047
> URL: https://issues.apache.org/jira/browse/SPARK-20047
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 2.1.0
>Reporter: DB Tsai
>Assignee: Yanbo Liang
>
> For certain applications, such as stacked regressions, it is important to put 
> non-negative constraints on the regression coefficients. Also, if the ranges 
> of coefficients are known, it makes sense to constrain the coefficient search 
> space.
> Fitting generalized constrained regression models object to Cβ ≤ b, where C ∈ 
> R^\{m×p\} and b ∈ R^\{m\} are predefined matrices and vectors which places a 
> set of m linear constraints on the coefficients is very challenging as 
> discussed in many literatures. 
> However, for box constraints on the coefficients, the optimization is well 
> solved. For gradient descent, people can projected gradient descent in the 
> primal by zeroing the negative weights at each step. For LBFGS, an extended 
> version of it, LBFGS-B can handle large scale box optimization efficiently. 
> Unfortunately, for OWLQN, there is no good efficient way to do optimization 
> with box constrains.
> As a result, in this work, we only implement constrained LR with box 
> constrains without L1 regularization. 
> Note that since we standardize the data in training phase, so the 
> coefficients seen in the optimization subroutine are in the scaled space; as 
> a result, we need to convert the box constrains into scaled space.
> Users will be able to set the lower / upper bounds of each coefficients and 
> intercepts.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20202) Remove references to org.spark-project.hive

2017-04-03 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15953489#comment-15953489
 ] 

Sean Owen commented on SPARK-20202:
---

Alrighty, you can leave the status for now, but generally committers set 
Blocker. I'm not entirely clear this blocks a release, not yet.

You're absolutely right, but, the hive fork with binaries and source is part of 
this project. At least, that's the idea. For example, this is notionally voted 
on and released with each Spark release, but the binary/source of this fork 
project isn't separately, explicitly, voted on and separately released. I think 
that should occur for avoidance of doubt, that this is a blessed artifact of 
the Spark project. Would this answer your process and policy concerns about the 
release? It's not pretty but I think that's within the law.

Of course, it's no answer in the long term. The goal is to not have to use the 
fork at all. If Hive packaging changes are already in place to make it 
unnecessary, great (is that all there is to it, everyone?) I don't know if that 
presents a solution for earlier versions of Hive. This fork thing may persist 
in existing branches, but it has to at least be released and used in a proper 
way. This may need fixes right now.

> Remove references to org.spark-project.hive
> ---
>
> Key: SPARK-20202
> URL: https://issues.apache.org/jira/browse/SPARK-20202
> Project: Spark
>  Issue Type: Bug
>  Components: Build, SQL
>Affects Versions: 1.6.4, 2.0.3, 2.1.1
>Reporter: Owen O'Malley
>Priority: Blocker
>
> Spark can't continue to depend on their fork of Hive and must move to 
> standard Hive versions.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20202) Remove references to org.spark-project.hive

2017-04-03 Thread Owen O'Malley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Owen O'Malley updated SPARK-20202:
--
Priority: Blocker  (was: Critical)

It is against Apache policy to release binaries that aren't part of your 
project.

> Remove references to org.spark-project.hive
> ---
>
> Key: SPARK-20202
> URL: https://issues.apache.org/jira/browse/SPARK-20202
> Project: Spark
>  Issue Type: Bug
>  Components: Build, SQL
>Affects Versions: 1.6.4, 2.0.3, 2.1.1
>Reporter: Owen O'Malley
>Priority: Blocker
>
> Spark can't continue to depend on their fork of Hive and must move to 
> standard Hive versions.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19809) NullPointerException on empty ORC file

2017-04-03 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15953386#comment-15953386
 ] 

Hyukjin Kwon commented on SPARK-19809:
--

Shoudn't it contain footer and schema information or a magic number at least? I 
am not sure if we can say 0 byte file is an ORC file. 

> NullPointerException on empty ORC file
> --
>
> Key: SPARK-19809
> URL: https://issues.apache.org/jira/browse/SPARK-19809
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 1.6.3, 2.0.2
>Reporter: Michał Dawid
>
> When reading from hive ORC table if there are some 0 byte files we get 
> NullPointerException:
> {code}java.lang.NullPointerException
>   at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$BISplitStrategy.getSplits(OrcInputFormat.java:560)
>   at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1010)
>   at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(OrcInputFormat.java:1048)
>   at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:240)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:240)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:240)
>   at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66)
>   at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:66)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:240)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:240)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:190)
>   at 
> org.apache.spark.sql.execution.Limit.executeCollect(basicOperators.scala:165)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:174)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1499)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1499)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56)
>   at 
> org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:2086)
>   at 
> org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$execute$1(DataFrame.scala:1498)
>   at 
> org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$collect(DataFrame.scala:1505)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$head$1.apply(DataFrame.scala:1375)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$head$1.apply(DataFrame.scala:1374)
>   at org.apache.spark.sql.DataFrame.withCallback(DataFrame.scala:2099)
>   at org.apache.spark.sql.DataFrame.head(DataFrame.scala:1374)
>   at org.apache.spark.sql.DataFrame.take(DataFrame.scala:1456)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
>

[jira] [Commented] (SPARK-19809) NullPointerException on empty ORC file

2017-04-03 Thread JIRA


[ 
https://issues.apache.org/jira/browse/SPARK-19809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15953341#comment-15953341
 ] 

Michał Dawid commented on SPARK-19809:
--

Those empty files have been created while processing with Pig scripts.
{code}-rw-rw-rw-   3 etl hdfs  14103 2017-04-03 01:26 
part-v001-o000-r-0_a_2
-rw-rw-rw-   3 etl hdfs  0 2017-04-03 01:26 part-v001-o000-r-0_a_3
-rw-rw-rw-   3 etl hdfs  10125 2017-04-03 01:27 part-v001-o000-r-0_a_4 
{code}

> NullPointerException on empty ORC file
> --
>
> Key: SPARK-19809
> URL: https://issues.apache.org/jira/browse/SPARK-19809
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 1.6.3, 2.0.2
>Reporter: Michał Dawid
>
> When reading from hive ORC table if there are some 0 byte files we get 
> NullPointerException:
> {code}java.lang.NullPointerException
>   at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$BISplitStrategy.getSplits(OrcInputFormat.java:560)
>   at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1010)
>   at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(OrcInputFormat.java:1048)
>   at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:240)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:240)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:240)
>   at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66)
>   at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:66)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:240)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:240)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:190)
>   at 
> org.apache.spark.sql.execution.Limit.executeCollect(basicOperators.scala:165)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:174)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1499)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1499)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56)
>   at 
> org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:2086)
>   at 
> org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$execute$1(DataFrame.scala:1498)
>   at 
> org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$collect(DataFrame.scala:1505)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$head$1.apply(DataFrame.scala:1375)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$head$1.apply(DataFrame.scala:1374)
>   at org.apache.spark.sql.DataFrame.withCallback(DataFrame.scala:2099)
>   at org.apache.spark.sql.DataFrame.head(DataFrame.scala:1374)
>   at

[jira] [Commented] (SPARK-20203) Change default maxPatternLength value to Int.MaxValue in PrefixSpan

2017-04-03 Thread Cyril de Vogelaere (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15953328#comment-15953328
 ] 

Cyril de Vogelaere commented on SPARK-20203:


Oh, I thought we were talking about the performance implication of adding an if 
which would be tested often.

For the issue you just pointed, I will agree it would be a major negative 
consequence of that change.
Sorry, I didn't understand that it was what you were talking about.

Well, then I suppose we should resolve this thread with a "won't fix". Except 
if you think the potential user friendlyness can balance that major default.

> Change default maxPatternLength value to Int.MaxValue in PrefixSpan
> ---
>
> Key: SPARK-20203
> URL: https://issues.apache.org/jira/browse/SPARK-20203
> Project: Spark
>  Issue Type: Wish
>  Components: MLlib
>Affects Versions: 2.1.0
>Reporter: Cyril de Vogelaere
>Priority: Trivial
>   Original Estimate: 0h
>  Remaining Estimate: 0h
>
> I think changing the default value to Int.MaxValue would be more user 
> friendly. At least for new users.
> Personally, when I run an algorithm, I expect it to find all solution by 
> default. And a limited number of them, when I set the parameters to do so.
> The current implementation limit the length of solution patterns to 10.
> Thus preventing all solution to be printed when running slightly large 
> datasets.
> I feel like that should be changed, but since this would change the default 
> behavior of PrefixSpan. I think asking for the communities opinion should 
> come first. So, what do you think ?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20203) Change default maxPatternLength value to Int.MaxValue in PrefixSpan

2017-04-03 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15953319#comment-15953319
 ] 

Sean Owen commented on SPARK-20203:
---

How can this not have performance implications? you generate more frequent 
patterns, potentially a lot more. You can see this even in the comments and 
error messages about collecting too many elements to the driver.

> Change default maxPatternLength value to Int.MaxValue in PrefixSpan
> ---
>
> Key: SPARK-20203
> URL: https://issues.apache.org/jira/browse/SPARK-20203
> Project: Spark
>  Issue Type: Wish
>  Components: MLlib
>Affects Versions: 2.1.0
>Reporter: Cyril de Vogelaere
>Priority: Trivial
>   Original Estimate: 0h
>  Remaining Estimate: 0h
>
> I think changing the default value to Int.MaxValue would be more user 
> friendly. At least for new users.
> Personally, when I run an algorithm, I expect it to find all solution by 
> default. And a limited number of them, when I set the parameters to do so.
> The current implementation limit the length of solution patterns to 10.
> Thus preventing all solution to be printed when running slightly large 
> datasets.
> I feel like that should be changed, but since this would change the default 
> behavior of PrefixSpan. I think asking for the communities opinion should 
> come first. So, what do you think ?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20203) Change default maxPatternLength value to Int.MaxValue in PrefixSpan

2017-04-03 Thread Cyril de Vogelaere (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15953318#comment-15953318
 ] 

Cyril de Vogelaere commented on SPARK-20203:


I'm not splitting it, I deleted the other thread.

I did agree adding the zero special value might have a tiny negative effect on 
performance, without adding new functionnalities.
So I closed it, following that line of thought.

This post, is just about changing the default value. Which, you agreed, can be 
discussed.
That's a new context of discussion, so I created a new thread. This should make 
more sense no ?


> Change default maxPatternLength value to Int.MaxValue in PrefixSpan
> ---
>
> Key: SPARK-20203
> URL: https://issues.apache.org/jira/browse/SPARK-20203
> Project: Spark
>  Issue Type: Wish
>  Components: MLlib
>Affects Versions: 2.1.0
>Reporter: Cyril de Vogelaere
>Priority: Trivial
>   Original Estimate: 0h
>  Remaining Estimate: 0h
>
> I think changing the default value to Int.MaxValue would be more user 
> friendly. At least for new users.
> Personally, when I run an algorithm, I expect it to find all solution by 
> default. And a limited number of them, when I set the parameters to do so.
> The current implementation limit the length of solution patterns to 10.
> Thus preventing all solution to be printed when running slightly large 
> datasets.
> I feel like that should be changed, but since this would change the default 
> behavior of PrefixSpan. I think asking for the communities opinion should 
> come first. So, what do you think ?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20202) Remove references to org.spark-project.hive

2017-04-03 Thread Owen O'Malley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15953315#comment-15953315
 ] 

Owen O'Malley commented on SPARK-20202:
---

I should also say here that the Hive community is willing to help. We are in 
the process of rolling releases so if Spark needs a change,  we can work 
together to get this done.

> Remove references to org.spark-project.hive
> ---
>
> Key: SPARK-20202
> URL: https://issues.apache.org/jira/browse/SPARK-20202
> Project: Spark
>  Issue Type: Bug
>  Components: Build, SQL
>Affects Versions: 1.6.4, 2.0.3, 2.1.1
>Reporter: Owen O'Malley
>Priority: Critical
>
> Spark can't continue to depend on their fork of Hive and must move to 
> standard Hive versions.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-20203) Change default maxPatternLength value to Int.MaxValue in PrefixSpan

2017-04-03 Thread Cyril de Vogelaere (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15953299#comment-15953299
 ] 

Cyril de Vogelaere edited comment on SPARK-20203 at 4/3/17 11:18 AM:
-

This cannot have performance implication, we are not changing anything but the 
default value.
It does change the number of solution we are searching for. So of course it 
will take longer since the search space is bigger.

But on a dataset where it already found everything, it should still do so. And 
not be slower at all.
Now, it would just find everything by default. Which, I agree, should be 
debated. To know whether that's really what we want the default behavior of the 
program to be.


was (Author: syrux):
This cannot have performance implication, we are not changing anything but the 
default value.
It does change the number of solution we are searching for. So of course it 
will take longer since the search space is bigger.

But on a dataset where it already found everything, it should still do so.
Now, it would just find everything by default. Which, I agree, should be 
debated. To know whether that's really what we want the default behavior of the 
program to be.

> Change default maxPatternLength value to Int.MaxValue in PrefixSpan
> ---
>
> Key: SPARK-20203
> URL: https://issues.apache.org/jira/browse/SPARK-20203
> Project: Spark
>  Issue Type: Wish
>  Components: MLlib
>Affects Versions: 2.1.0
>Reporter: Cyril de Vogelaere
>Priority: Trivial
>   Original Estimate: 0h
>  Remaining Estimate: 0h
>
> I think changing the default value to Int.MaxValue would be more user 
> friendly. At least for new users.
> Personally, when I run an algorithm, I expect it to find all solution by 
> default. And a limited number of them, when I set the parameters to do so.
> The current implementation limit the length of solution patterns to 10.
> Thus preventing all solution to be printed when running slightly large 
> datasets.
> I feel like that should be changed, but since this would change the default 
> behavior of PrefixSpan. I think asking for the communities opinion should 
> come first. So, what do you think ?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20203) Change default maxPatternLength value to Int.MaxValue in PrefixSpan

2017-04-03 Thread Cyril de Vogelaere (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15953299#comment-15953299
 ] 

Cyril de Vogelaere commented on SPARK-20203:


This cannot have performance implication, we are not changing anything but the 
default value.
It does change the number of solution we are searching for. So of course it 
will take longer since the search space is bigger.

But on a dataset where it already found everything, it should still do so.
Now, it would just find everything by default. Which, I agree, should be 
debated. To know whether that's really what we want the default behavior of the 
program to be.

> Change default maxPatternLength value to Int.MaxValue in PrefixSpan
> ---
>
> Key: SPARK-20203
> URL: https://issues.apache.org/jira/browse/SPARK-20203
> Project: Spark
>  Issue Type: Wish
>  Components: MLlib
>Affects Versions: 2.1.0
>Reporter: Cyril de Vogelaere
>Priority: Trivial
>   Original Estimate: 0h
>  Remaining Estimate: 0h
>
> I think changing the default value to Int.MaxValue would be more user 
> friendly. At least for new users.
> Personally, when I run an algorithm, I expect it to find all solution by 
> default. And a limited number of them, when I set the parameters to do so.
> The current implementation limit the length of solution patterns to 10.
> Thus preventing all solution to be printed when running slightly large 
> datasets.
> I feel like that should be changed, but since this would change the default 
> behavior of PrefixSpan. I think asking for the communities opinion should 
> come first. So, what do you think ?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20202) Remove references to org.spark-project.hive

2017-04-03 Thread Owen O'Malley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15953298#comment-15953298
 ] 

Owen O'Malley commented on SPARK-20202:
---

As an Apache member, the Spark project can't release binary artifacts that 
aren't made from its Apache code base. So either, the Spark project needs to 
use Hive's release artifacts or it formally fork Hive and move the fork into 
its git repository at Apache and rename it away from org.apache.hive to 
org.apache.spark. The current path is not allowed.

Hive is in the middle of rolling releases and thus this is a good time to make 
requests. The old uber jar (hive-exec) is already released separately with the 
classifier "core." It looks like we are using the same protobuf (2.5.0) and 
kryo (3.0.3) versions.

> Remove references to org.spark-project.hive
> ---
>
> Key: SPARK-20202
> URL: https://issues.apache.org/jira/browse/SPARK-20202
> Project: Spark
>  Issue Type: Bug
>  Components: Build, SQL
>Affects Versions: 1.6.4, 2.0.3, 2.1.1
>Reporter: Owen O'Malley
>Priority: Critical
>
> Spark can't continue to depend on their fork of Hive and must move to 
> standard Hive versions.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-20202) Remove references to org.spark-project.hive

2017-04-03 Thread Owen O'Malley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15953298#comment-15953298
 ] 

Owen O'Malley edited comment on SPARK-20202 at 4/3/17 11:16 AM:


As an Apache member, the Spark project can't release binary artifacts that 
aren't made from its Apache code base. So either, the Spark project needs to 
use Hive's release artifacts or it needs to formally fork Hive and move the 
fork into its git repository at Apache and rename it away from org.apache.hive 
to org.apache.spark. The current path is not allowed.

Hive is in the middle of rolling releases and thus this is a good time to make 
requests. The old uber jar (hive-exec) is already released separately with the 
classifier "core." It looks like we are using the same protobuf (2.5.0) and 
kryo (3.0.3) versions.


was (Author: owen.omalley):
As an Apache member, the Spark project can't release binary artifacts that 
aren't made from its Apache code base. So either, the Spark project needs to 
use Hive's release artifacts or it formally fork Hive and move the fork into 
its git repository at Apache and rename it away from org.apache.hive to 
org.apache.spark. The current path is not allowed.

Hive is in the middle of rolling releases and thus this is a good time to make 
requests. The old uber jar (hive-exec) is already released separately with the 
classifier "core." It looks like we are using the same protobuf (2.5.0) and 
kryo (3.0.3) versions.

> Remove references to org.spark-project.hive
> ---
>
> Key: SPARK-20202
> URL: https://issues.apache.org/jira/browse/SPARK-20202
> Project: Spark
>  Issue Type: Bug
>  Components: Build, SQL
>Affects Versions: 1.6.4, 2.0.3, 2.1.1
>Reporter: Owen O'Malley
>Priority: Critical
>
> Spark can't continue to depend on their fork of Hive and must move to 
> standard Hive versions.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20203) Change default maxPatternLength value to Int.MaxValue in PrefixSpan

2017-04-03 Thread Cyril de Vogelaere (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15953297#comment-15953297
 ] 

Cyril de Vogelaere commented on SPARK-20203:


SPARK-20180 was about adding a special value (0) to find all pattern no matter 
their length, and put it as default value.
You pointed it might lower the performances, without adding more 
functionalities. So I closed that thread.

This one is just about changing the default value, no other changes in the code.
You said it needed discussion, since it was a change in default behavior. But 
the amount of comment on the last thread would discourage discussion, I felt 
like a new thread would be more appropriate.

> Change default maxPatternLength value to Int.MaxValue in PrefixSpan
> ---
>
> Key: SPARK-20203
> URL: https://issues.apache.org/jira/browse/SPARK-20203
> Project: Spark
>  Issue Type: Wish
>  Components: MLlib
>Affects Versions: 2.1.0
>Reporter: Cyril de Vogelaere
>Priority: Trivial
>   Original Estimate: 0h
>  Remaining Estimate: 0h
>
> I think changing the default value to Int.MaxValue would be more user 
> friendly. At least for new users.
> Personally, when I run an algorithm, I expect it to find all solution by 
> default. And a limited number of them, when I set the parameters to do so.
> The current implementation limit the length of solution patterns to 10.
> Thus preventing all solution to be printed when running slightly large 
> datasets.
> I feel like that should be changed, but since this would change the default 
> behavior of PrefixSpan. I think asking for the communities opinion should 
> come first. So, what do you think ?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-20180) Add a special value for unlimited max pattern length in Prefix span, and set it as default.

2017-04-03 Thread Cyril de Vogelaere (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cyril de Vogelaere closed SPARK-20180.
--
Resolution: Won't Fix

> Add a special value for unlimited max pattern length in Prefix span, and set 
> it as default.
> ---
>
> Key: SPARK-20180
> URL: https://issues.apache.org/jira/browse/SPARK-20180
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.1.0
>Reporter: Cyril de Vogelaere
>Priority: Minor
>   Original Estimate: 0h
>  Remaining Estimate: 0h
>
> Right now, we need to use .setMaxPatternLength() method to
> specify is the maximum pattern length of a sequence. Any pattern longer than 
> that won't be outputted.
> The current default maxPatternlength value being 10.
> This should be changed so that with input 0, all pattern of any length would 
> be outputted. Additionally, the default value should be changed to 0, so that 
> a new user could find all patterns in his dataset without looking at this 
> parameter.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20203) Change default maxPatternLength value to Int.MaxValue in PrefixSpan

2017-04-03 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15953289#comment-15953289
 ] 

Sean Owen commented on SPARK-20203:
---

This is again not addressing the point, that doing so has performance 
implications. Or could. That has to be established.

> Change default maxPatternLength value to Int.MaxValue in PrefixSpan
> ---
>
> Key: SPARK-20203
> URL: https://issues.apache.org/jira/browse/SPARK-20203
> Project: Spark
>  Issue Type: Wish
>  Components: MLlib
>Affects Versions: 2.1.0
>Reporter: Cyril de Vogelaere
>Priority: Trivial
>   Original Estimate: 0h
>  Remaining Estimate: 0h
>
> I think changing the default value to Int.MaxValue would be more user 
> friendly. At least for new users.
> Personally, when I run an algorithm, I expect it to find all solution by 
> default. And a limited number of them, when I set the parameters to do so.
> The current implementation limit the length of solution patterns to 10.
> Thus preventing all solution to be printed when running slightly large 
> datasets.
> I feel like that should be changed, but since this would change the default 
> behavior of PrefixSpan. I think asking for the communities opinion should 
> come first. So, what do you think ?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20203) Change default maxPatternLength value to Int.MaxValue in PrefixSpan

2017-04-03 Thread Cyril de Vogelaere (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cyril de Vogelaere updated SPARK-20203:
---
Description: 
I think changing the default value to Int.MaxValue would be more user friendly. 
At least for new users.

Personally, when I run an algorithm, I expect it to find all solution by 
default. And a limited number of them, when I set the parameters to do so.

The current implementation limit the length of solution patterns to 10.
Thus preventing all solution to be printed when running slightly large datasets.

I feel like that should be changed, but since this would change the default 
behavior of PrefixSpan. I think asking for the communities opinion should come 
first. So, what do you think ?

  was:
I think changing the default value to Int.MaxValue would
be more user friendly. At least for new user.

Personally, when I run an algorithm, I expect it to find all solution by 
default. And a limited number of them, when I set the parameters so.

The current implementation limit the length of solution patterns to 10.
Thus preventing all solution to be printed when running slightly large datasets.


> Change default maxPatternLength value to Int.MaxValue in PrefixSpan
> ---
>
> Key: SPARK-20203
> URL: https://issues.apache.org/jira/browse/SPARK-20203
> Project: Spark
>  Issue Type: Wish
>  Components: MLlib
>Affects Versions: 2.1.0
>Reporter: Cyril de Vogelaere
>Priority: Trivial
>   Original Estimate: 0h
>  Remaining Estimate: 0h
>
> I think changing the default value to Int.MaxValue would be more user 
> friendly. At least for new users.
> Personally, when I run an algorithm, I expect it to find all solution by 
> default. And a limited number of them, when I set the parameters to do so.
> The current implementation limit the length of solution patterns to 10.
> Thus preventing all solution to be printed when running slightly large 
> datasets.
> I feel like that should be changed, but since this would change the default 
> behavior of PrefixSpan. I think asking for the communities opinion should 
> come first. So, what do you think ?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20180) Add a special value for unlimited max pattern length in Prefix span, and set it as default.

2017-04-03 Thread Cyril de Vogelaere (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15953282#comment-15953282
 ] 

Cyril de Vogelaere commented on SPARK-20180:


Fine, I thought a TODO left in the code would reflect the wish of the 
community, at least a little.
I will close this thread and open a new one on changing the default value to 
maxInteger, since I personnally think it would be more friendly to new users.

Link to new thread : https://issues.apache.org/jira/browse/SPARK-20203

Tomorrow, I will create a new thread with another improvement I want to add to 
spark. I need to run a performance test on just that change first,
to prove it will be usefull. I hope you will follow it too.

> Add a special value for unlimited max pattern length in Prefix span, and set 
> it as default.
> ---
>
> Key: SPARK-20180
> URL: https://issues.apache.org/jira/browse/SPARK-20180
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.1.0
>Reporter: Cyril de Vogelaere
>Priority: Minor
>   Original Estimate: 0h
>  Remaining Estimate: 0h
>
> Right now, we need to use .setMaxPatternLength() method to
> specify is the maximum pattern length of a sequence. Any pattern longer than 
> that won't be outputted.
> The current default maxPatternlength value being 10.
> This should be changed so that with input 0, all pattern of any length would 
> be outputted. Additionally, the default value should be changed to 0, so that 
> a new user could find all patterns in his dataset without looking at this 
> parameter.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20203) Change default maxPatternLength value to Int.MaxValue in PrefixSpan

2017-04-03 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15953280#comment-15953280
 ] 

Sean Owen commented on SPARK-20203:
---

I don't understand, this is the same as SPARK-20180?

> Change default maxPatternLength value to Int.MaxValue in PrefixSpan
> ---
>
> Key: SPARK-20203
> URL: https://issues.apache.org/jira/browse/SPARK-20203
> Project: Spark
>  Issue Type: Wish
>  Components: MLlib
>Affects Versions: 2.1.0
>Reporter: Cyril de Vogelaere
>Priority: Trivial
>   Original Estimate: 0h
>  Remaining Estimate: 0h
>
> I think changing the default value to Int.MaxValue would
> be more user friendly. At least for new user.
> Personally, when I run an algorithm, I expect it to find all solution by 
> default. And a limited number of them, when I set the parameters so.
> The current implementation limit the length of solution patterns to 10.
> Thus preventing all solution to be printed when running slightly large 
> datasets.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-20203) Change default maxPatternLength value to Int.MaxValue in PrefixSpan

2017-04-03 Thread Cyril de Vogelaere (JIRA)

Cyril de Vogelaere created SPARK-20203:
--

 Summary: Change default maxPatternLength value to Int.MaxValue in 
PrefixSpan
 Key: SPARK-20203
 URL: https://issues.apache.org/jira/browse/SPARK-20203
 Project: Spark
  Issue Type: Wish
  Components: MLlib
Affects Versions: 2.1.0
Reporter: Cyril de Vogelaere
Priority: Trivial


I think changing the default value to Int.MaxValue would
be more user friendly. At least for new user.

Personally, when I run an algorithm, I expect it to find all solution by 
default. And a limited number of them, when I set the parameters so.

The current implementation limit the length of solution patterns to 10.
Thus preventing all solution to be printed when running slightly large datasets.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20202) Remove references to org.spark-project.hive

2017-04-03 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-20202:
--
 Priority: Critical  (was: Blocker)
Fix Version/s: (was: 2.1.1)
   (was: 1.6.4)
   (was: 2.0.3)

I see wide agreement on that. One question I have is, is including Hive this 
way merely a really-not-nice-to-have or actually not allowed? I think the 
question is whether sources are available, right? because releases can't have 
binary-only parts. I plead ignorance, I have never myself paid much attention 
to this integration. 

If it's not then this sounds like something has to change for releases beyond 
2.1.1 and this can be targeted as a Blocker accordingly.

Does this depend on refactoring or changes in Hive? IIRC the problem was 
hive-exec being an uber-jar, but it's been a long time since I read any of that 
discussion.

> Remove references to org.spark-project.hive
> ---
>
> Key: SPARK-20202
> URL: https://issues.apache.org/jira/browse/SPARK-20202
> Project: Spark
>  Issue Type: Bug
>  Components: Build, SQL
>Affects Versions: 1.6.4, 2.0.3, 2.1.1
>Reporter: Owen O'Malley
>Priority: Critical
>
> Spark can't continue to depend on their fork of Hive and must move to 
> standard Hive versions.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20180) Add a special value for unlimited max pattern length in Prefix span, and set it as default.

2017-04-03 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15953240#comment-15953240
 ] 

Sean Owen commented on SPARK-20180:
---

Surely, the impact is more than an 'if' statement. If you contemplate much 
larger spans that's going to take longer to compute and return right? I think 
we're not at all in agreement there, especially as you're seeing the test (?) 
run forever.

Yes I know there's a TODO (BTW you can see who wrote it with 'blame') but that 
doesn't mean I agree with it. It also doesn't say it should be a default.

Keep in mind how much time it takes to discuss these changes relative to the 
value. We need to converge rapidly to decisions. The question here is 
performance impact on non-trivial examples. So far I just don't see much 
compelling reason to change a default. The functionality you want is already 
available.

> Add a special value for unlimited max pattern length in Prefix span, and set 
> it as default.
> ---
>
> Key: SPARK-20180
> URL: https://issues.apache.org/jira/browse/SPARK-20180
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.1.0
>Reporter: Cyril de Vogelaere
>Priority: Minor
>   Original Estimate: 0h
>  Remaining Estimate: 0h
>
> Right now, we need to use .setMaxPatternLength() method to
> specify is the maximum pattern length of a sequence. Any pattern longer than 
> that won't be outputted.
> The current default maxPatternlength value being 10.
> This should be changed so that with input 0, all pattern of any length would 
> be outputted. Additionally, the default value should be changed to 0, so that 
> a new user could find all patterns in his dataset without looking at this 
> parameter.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-20202) Remove references to org.spark-project.hive

2017-04-03 Thread Owen O'Malley (JIRA)

Owen O'Malley created SPARK-20202:
-

 Summary: Remove references to org.spark-project.hive
 Key: SPARK-20202
 URL: https://issues.apache.org/jira/browse/SPARK-20202
 Project: Spark
  Issue Type: Bug
  Components: Build, SQL
Affects Versions: 1.6.4, 2.0.3, 2.1.1
Reporter: Owen O'Malley
Priority: Blocker
 Fix For: 1.6.4, 2.0.3, 2.1.1


Spark can't continue to depend on their fork of Hive and must move to standard 
Hive versions.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-19752) OrcGetSplits fails with 0 size files

2017-04-03 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-19752.
--
Resolution: Duplicate

It sounds a duplicate of SPARK-19809. Please reopen that if I misunderstood.

> OrcGetSplits fails with 0 size files
> 
>
> Key: SPARK-19752
> URL: https://issues.apache.org/jira/browse/SPARK-19752
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.1.0
>Reporter: Nick Orka
>
> There is a possibility that during some sql queries a partition may have a 0 
> size file (empty file). Next time when I try to read from the file by sql 
> query, I'm getting this error:
> 17/02/27 10:33:11 INFO PerfLogger:  start=1488191591570 end=1488191591599 duration=29 
> from=org.apache.hadoop.hive.ql.io.orc.ReaderImpl>
> 17/02/27 10:33:11 ERROR ApplicationMaster: User class threw exception: 
> java.lang.reflect.InvocationTargetException
> java.lang.reflect.InvocationTargetException
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> scala.reflect.runtime.JavaMirrors$JavaMirror$JavaVanillaMethodMirror1.jinvokeraw(JavaMirrors.scala:373)
>   at 
> scala.reflect.runtime.JavaMirrors$JavaMirror$JavaMethodMirror.jinvoke(JavaMirrors.scala:339)
>   at 
> scala.reflect.runtime.JavaMirrors$JavaMirror$JavaVanillaMethodMirror.apply(JavaMirrors.scala:355)
>   at com.sessionm.Datapipeline$.main(Datapipeline.scala:200)
>   at com.sessionm.Datapipeline.main(Datapipeline.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:627)
> Caused by: java.lang.RuntimeException: serious problem
>   at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1021)
>   at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(OrcInputFormat.java:1048)
>   at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246)
>   at scala.Option.getOrElse(Option.scala:121)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:246)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246)
>   at scala.Option.getOrElse(Option.scala:121)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:246)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246)
>   at scala.Option.getOrElse(Option.scala:121)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:246)
>   at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:84)
>   at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:84)
>   at 
> scala.collection.parallel.AugmentedIterableIterator$class.map2combiner(RemainsIterator.scala:115)
>   at 
> scala.collection.parallel.immutable.ParVector$ParVectorIterator.map2combiner(ParVector.scala:62)
>   at 
> scala.collection.parallel.ParIterableLike$Map.leaf(ParIterableLike.scala:1054)
>   at 
> scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply$mcV$sp(Tasks.scala:49)
>   at 
> scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:48)
>   at 
> scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:48)
>   at scala.collection.parallel.Task$class.tryLeaf(Tasks.scala:51)
>   at 
> scala.collection.parallel.ParIterableLike$Map.tryLeaf(ParIterableLike.scala:1051)
>   at 
> scala.collection.parallel.AdaptiveWorkStealingTasks$WrappedTask$class.internal(Tasks.scala:169)
>   at 
> scala.collection.parallel.AdaptiveWorkStealingForkJoinTasks$WrappedTask.internal(Tasks.scala:443)
>   at 
> scala.collection.parallel.AdaptiveWorkStealingTasks$WrappedTask$class.compute(Tasks.scala:149)
>   at 
>

[jira] [Resolved] (SPARK-19809) NullPointerException on empty ORC file

2017-04-03 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-19809.
--
Resolution: Invalid

I don't think there is 0 byte ORC file. It should have the footer. Moreover, 
currently, Spark's ORC datasource does not write out empty files (see 
https://issues.apache.org/jira/browse/SPARK-15474).

Please reopen this if I misunderstood. It would be great if there is some steps 
to reproduce maybe to verify this issue.

I am resolving this.

> NullPointerException on empty ORC file
> --
>
> Key: SPARK-19809
> URL: https://issues.apache.org/jira/browse/SPARK-19809
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 1.6.3, 2.0.2
>Reporter: Michał Dawid
>
> When reading from hive ORC table if there are some 0 byte files we get 
> NullPointerException:
> {code}java.lang.NullPointerException
>   at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$BISplitStrategy.getSplits(OrcInputFormat.java:560)
>   at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1010)
>   at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(OrcInputFormat.java:1048)
>   at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:240)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:240)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:240)
>   at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66)
>   at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:66)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:240)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:240)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:190)
>   at 
> org.apache.spark.sql.execution.Limit.executeCollect(basicOperators.scala:165)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:174)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1499)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1499)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56)
>   at 
> org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:2086)
>   at 
> org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$execute$1(DataFrame.scala:1498)
>   at 
> org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$collect(DataFrame.scala:1505)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$head$1.apply(DataFrame.scala:1375)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$head$1.apply(DataFrame.scala:1374)
>   at org.apache.spark.sql.DataFrame.withCallback(DataFrame.scala:2099)
>   at org.apache.spark.sql.DataFrame.head(DataFrame.scala:1374)
>

[jira] [Comment Edited] (SPARK-20180) Add a special value for unlimited max pattern length in Prefix span, and set it as default.

2017-04-03 Thread Cyril de Vogelaere (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15953201#comment-15953201
 ] 

Cyril de Vogelaere edited comment on SPARK-20180 at 4/3/17 9:57 AM:


=> Why not let the default be Int.MaxValue?
I'm also ok with a default Int.MaxValue, if the special value zero is really 
something you are against.

if that's what this is about, update the title to reflect it.
=> I will gladly do that, if you think the current title is misleading.

This is a behavior change by default, so we should think carefully about it
=> Yes, I agree.

What are the downsides – why would someone have ever made it 10? presumably, 
performance.
=> The changed code consist simply in an additionnal condition in an if. If you 
want to see a graph, I have one that test the differences in performances, but 
on my implementation optimised for single-item pattern. So it wouldn't be 
relevant, if you are worried of performance drop, I can do additional tests, on 
the two lines I changed. If you want me to use some particular dataset, I will 
also gladly oblige. Just say the word, and you will have them by tommorow.

So it would be less about what impact it has on the performance, since it would 
be negligeable (again, i'm ready to prove that if you want me to), but about 
whether that feature seems needed or not. Which I agree, is debatable.

Also, whichever senior implemented it that way, left this comment : 
// TODO: support unbounded pattern length when maxPatternLength = 0
Which you can find in the original code, and is the reason I created this 
Jira's thread first. Among the list of improvement I want to propose.
You can find that line in the PrefixSpan code if you don't believe me.If theses 
change are rejected, then when I have the occasion, I will remove that line. 
Since this thread would have established that it isn't needed.

You mention tests don't end and haven't established it's not due to your 
change. 
=> I'm establishing that right now ... as I said. Also, they are ending, but 
they are really really slow.

I don't think we can proceed with this in this state, right?
=> I will leave the decision to you





was (Author: syrux):
=> Why not let the default be Int.MaxValue?
I'm also ok with a default Int.MaxValue, if the special value zero is really 
something you are against.

if that's what this is about, update the title to reflect it.
=> I will gladly do that, if you think the current title is misleading.

This is a behavior change by default, so we should think carefully about it
=> Yes, I agree.

What are the downsides – why would someone have ever made it 10? presumably, 
performance.
=> The changed code consist simply in an additionnal condition in an if. If you 
want to see a graph, I have one that test the differences in performances, but 
on my implementation optimised for single-item pattern. So it wouldn't be 
relevant, if you are worried of performance drop, I can do additional tests, on 
the two lines I changed. If you want me to use some particular dataset, I will 
also gladly oblige. Just say the word, and you will have them by tommorow.

So it would be less about what impact it has on the performance, since it would 
be negligeable (again, i'm ready to prove that if you want me to), but about 
whether that feature seems needed or not. Which I agree, is debatable.

Also, whichever senior implemented it that way, left this comment : 
// TODO: support unbounded pattern length when maxPatternLength = 0
Which you can find in the original code, and is the reason I created this 
Jira's thread first. Among the list of improvement I want to propose.
You can find that line in the PrefixSpan code if you don't believe me.If theses 
change are rejected, then when I have the occasion, I will remove that line. So 
it would establish it isn't needed.

You mention tests don't end and haven't established it's not due to your 
change. 
=> I'm establishing that right now ... as I said. Also, they are ending, but 
they are really really slow.

I don't think we can proceed with this in this state, right?
=> I will leave the decision to you




> Add a special value for unlimited max pattern length in Prefix span, and set 
> it as default.
> ---
>
> Key: SPARK-20180
> URL: https://issues.apache.org/jira/browse/SPARK-20180
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.1.0
>Reporter: Cyril de Vogelaere
>Priority: Minor
>   Original Estimate: 0h
>  Remaining Estimate: 0h
>
> Right now, we need to use .setMaxPatternLength() method to
> specify is the maximum pattern length of a sequence. Any pattern longer than 
> that won't be outputted.
> The current default maxPatternlength value

[jira] [Assigned] (SPARK-19641) JSON schema inference in DROPMALFORMED mode produces incorrect schema

2017-04-03 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-19641:
---

Assignee: Hyukjin Kwon

> JSON schema inference in DROPMALFORMED mode produces incorrect schema
> -
>
> Key: SPARK-19641
> URL: https://issues.apache.org/jira/browse/SPARK-19641
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Nathan Howell
>Assignee: Hyukjin Kwon
> Fix For: 2.2.0
>
>
> In {{DROPMALFORMED}} mode the inferred schema may incorrectly contain no 
> columns. This occurs when one document contains a valid JSON value (such as a 
> string or number) and the other documents contain objects or arrays.
> When the default case in {{JsonInferSchema.compatibleRootType}} is reached 
> when merging a {{StringType}} and a {{StructType}} the resulting type will be 
> a {{StringType}}, which is then discarded because a {{StructType}} is 
> expected.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-20180) Add a special value for unlimited max pattern length in Prefix span, and set it as default.

2017-04-03 Thread Cyril de Vogelaere (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15953201#comment-15953201
 ] 

Cyril de Vogelaere edited comment on SPARK-20180 at 4/3/17 9:45 AM:


=> Why not let the default be Int.MaxValue?
I'm also ok with a default Int.MaxValue, if the special value zero is really 
something you are against.

if that's what this is about, update the title to reflect it.
=> I will gladly do that, if you think the current title is misleading.

This is a behavior change by default, so we should think carefully about it
=> Yes, I agree.

What are the downsides – why would someone have ever made it 10? presumably, 
performance.
=> The changed code consist simply in an additionnal condition in an if. If you 
want to see a graph, I have one that test the differences in performances, but 
on my implementation optimised for single-item pattern. So it wouldn't be 
relevant, if you are worried of performance drop, I can do additional tests, on 
the two lines I changed. If you want me to use some particular dataset, I will 
also gladly oblige. Just say the word, and you will have them by tommorow.

So it would be less about what impact it has on the performance, since it would 
be negligeable (again, i'm ready to prove that if you want me to), but about 
whether that feature seems needed or not. Which I agree, is debatable.

Also, whichever senior implemented it that way, left this comment : 
// TODO: support unbounded pattern length when maxPatternLength = 0
Which you can find in the original code, and is the reason I created this 
Jira's thread first. Among the list of improvement I want to propose.
You can find that line in the PrefixSpan code if you don't believe me.If theses 
change are rejected, then when I have the occasion, I will remove that line. So 
it would establish it isn't needed.

You mention tests don't end and haven't established it's not due to your 
change. 
=> I'm establishing that right now ... as I said. Also, they are ending, but 
they are really really slow.

I don't think we can proceed with this in this state, right?
=> I will leave the decision to you





was (Author: syrux):
=> Why not let the default be Int.MaxValue?
I'm also ok with a default Int.MaxValue, if the special value zero is really 
something you are against.

if that's what this is about, update the title to reflect it.
=> I will gladly do that, if you think the current title is misleading.

This is a behavior change by default, so we should think carefully about it
=> Yes, I agree.

What are the downsides – why would someone have ever made it 10? presumably, 
performance.
=> The changed code consist simply in an additionnal condition in an if. If you 
want to see a graph, I have one that test the differences in performances, but 
on my implementation optimised for single-item pattern. So it wouldn't be 
relevant, if you are worried of performance drop, I can do additional tests, on 
the two lines I changed. If you want me to use some particular dataset, I will 
also gladly oblige. Just say the word, and you will have them by tommorow.

So it would be less about what impact it has on the performance, since it would 
be negligeable (again, i'm ready to prove that if you want me to), but about 
whether that feature seems needed or not. Which I agree, is debatable.

Also, whichever senior implemented it that way, left this comment : 
// TODO: support unbounded pattern length when maxPatternLength = 0
Which you can find in the original code, and is the reason I created this 
Jira's thread first. Among the list of improvement I want to propose.
You can find that line in the PrefixSpan code if you don't believe me.If theses 
change are rejected, then when I have the occasion, I will remove that line. So 
it would establish it isn't needed.

You mention tests don't end and haven't established it's not due to your 
change. 
=> I'm establishing that right now ... as I said.

I don't think we can proceed with this in this state, right?
=> I will leave the decision to you




> Add a special value for unlimited max pattern length in Prefix span, and set 
> it as default.
> ---
>
> Key: SPARK-20180
> URL: https://issues.apache.org/jira/browse/SPARK-20180
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.1.0
>Reporter: Cyril de Vogelaere
>Priority: Minor
>   Original Estimate: 0h
>  Remaining Estimate: 0h
>
> Right now, we need to use .setMaxPatternLength() method to
> specify is the maximum pattern length of a sequence. Any pattern longer than 
> that won't be outputted.
> The current default maxPatternlength value being 10.
> This should be changed so that with input 0, all pattern of any

[jira] [Resolved] (SPARK-19641) JSON schema inference in DROPMALFORMED mode produces incorrect schema

2017-04-03 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-19641.
-
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 17492
[https://github.com/apache/spark/pull/17492]

> JSON schema inference in DROPMALFORMED mode produces incorrect schema
> -
>
> Key: SPARK-19641
> URL: https://issues.apache.org/jira/browse/SPARK-19641
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Nathan Howell
> Fix For: 2.2.0
>
>
> In {{DROPMALFORMED}} mode the inferred schema may incorrectly contain no 
> columns. This occurs when one document contains a valid JSON value (such as a 
> string or number) and the other documents contain objects or arrays.
> When the default case in {{JsonInferSchema.compatibleRootType}} is reached 
> when merging a {{StringType}} and a {{StructType}} the resulting type will be 
> a {{StringType}}, which is then discarded because a {{StructType}} is 
> expected.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20180) Add a special value for unlimited max pattern length in Prefix span, and set it as default.

2017-04-03 Thread Cyril de Vogelaere (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15953201#comment-15953201
 ] 

Cyril de Vogelaere commented on SPARK-20180:


=> Why not let the default be Int.MaxValue?
I'm also ok with a default Int.MaxValue, if the special value zero is really 
something you are against.

if that's what this is about, update the title to reflect it.
=> I will gladly do that, if you think the current title is misleading.

This is a behavior change by default, so we should think carefully about it
=> Yes, I agree.

What are the downsides – why would someone have ever made it 10? presumably, 
performance.
=> The changed code consist simply in an additionnal condition in an if. If you 
want to see a graph, I have one that test the differences in performances, but 
on my implementation optimised for single-item pattern. So it wouldn't be 
relevant, if you are worried of performance drop, I can do additional tests, on 
the two lines I changed. If you want me to use some particular dataset, I will 
also gladly oblige. Just say the word, and you will have them by tommorow.

So it would be less about what impact it has on the performance, since it would 
be negligeable (again, i'm ready to prove that if you want me to), but about 
whether that feature seems needed or not. Which I agree, is debatable.

Also, whichever senior implemented it that way, left this comment : 
// TODO: support unbounded pattern length when maxPatternLength = 0
Which you can find in the original code, and is the reason I created this 
Jira's thread first. Among the list of improvement I want to propose.
You can find that line in the PrefixSpan code if you don't believe me.If theses 
change are rejected, then when I have the occasion, I will remove that line. So 
it would establish it isn't needed.

You mention tests don't end and haven't established it's not due to your 
change. 
=> I'm establishing that right now ... as I said.

I don't think we can proceed with this in this state, right?
=> I will leave the decision to you




> Add a special value for unlimited max pattern length in Prefix span, and set 
> it as default.
> ---
>
> Key: SPARK-20180
> URL: https://issues.apache.org/jira/browse/SPARK-20180
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.1.0
>Reporter: Cyril de Vogelaere
>Priority: Minor
>   Original Estimate: 0h
>  Remaining Estimate: 0h
>
> Right now, we need to use .setMaxPatternLength() method to
> specify is the maximum pattern length of a sequence. Any pattern longer than 
> that won't be outputted.
> The current default maxPatternlength value being 10.
> This should be changed so that with input 0, all pattern of any length would 
> be outputted. Additionally, the default value should be changed to 0, so that 
> a new user could find all patterns in his dataset without looking at this 
> parameter.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19969) Doc and examples for Imputer

2017-04-03 Thread Nick Pentreath (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath reassigned SPARK-19969:
--

Assignee: yuhao yang

> Doc and examples for Imputer
> 
>
> Key: SPARK-19969
> URL: https://issues.apache.org/jira/browse/SPARK-19969
> Project: Spark
>  Issue Type: Documentation
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Nick Pentreath
>Assignee: yuhao yang
> Fix For: 2.2.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-19969) Doc and examples for Imputer

2017-04-03 Thread Nick Pentreath (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath resolved SPARK-19969.

   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 17324
[https://github.com/apache/spark/pull/17324]

> Doc and examples for Imputer
> 
>
> Key: SPARK-19969
> URL: https://issues.apache.org/jira/browse/SPARK-19969
> Project: Spark
>  Issue Type: Documentation
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Nick Pentreath
> Fix For: 2.2.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20090) Add StructType.fieldNames to Python API

2017-04-03 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15953199#comment-15953199
 ] 

Hyukjin Kwon commented on SPARK-20090:
--

[~josephkb], gentle ping.

> Add StructType.fieldNames to Python API
> ---
>
> Key: SPARK-20090
> URL: https://issues.apache.org/jira/browse/SPARK-20090
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark, SQL
>Affects Versions: 2.1.0
>Reporter: Joseph K. Bradley
>Priority: Trivial
>
> The Scala/Java API for {{StructType}} has a method {{fieldNames}}.  It would 
> be nice if the Python {{StructType}} did as well.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20108) Spark query is getting failed with exception

2017-04-03 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15953196#comment-15953196
 ] 

Hyukjin Kwon commented on SPARK-20108:
--

It will help other guys like me to track down the problem and solve this.

> Spark query is getting failed with exception
> 
>
> Key: SPARK-20108
> URL: https://issues.apache.org/jira/browse/SPARK-20108
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.0
>Reporter: ZS EDGE
>
> In our project we have implemented a logic where we programatically generate 
> spark queries. These queries are executed as a sub query and below is the 
> sample query--
> sqlContext.sql("INSERT INTO TABLE 
> test_client_r2_r2_2_prod_db1_oz.S3_EMPDTL_Incremental_invalid SELECT 
> 'S3_EMPDTL_Incremental',S3_EMPDTL_Incremental.row_id,S3_EMPDTL_Incremental.SOURCE_FILE_NAME,S3_EMPDTL_Incremental.SOURCE_ROW_ID,'S3_EMPDTL_Incremental','2017-03-22
>  
> 20:18:59','1','Emp_id#$Emp_name#$Emp_phone#$Emp_salary_in_K#$Emp_address_id#$Date_of_Birth#$Status#$Dept_id#$Date_of_joining#$Row_Number#$Dec_check#$','test','Y','N/A','',''
>  FROM S3_EMPDTL_Incremental_r AS S3_EMPDTL_Incremental where row_id IN 
> (select row_id from s3_empdtl_incremental_r where row_id IN(42949672960))")
> While executing the above code in the pyspark it is throwing below exception--
>  FAILS>>
> .spark.SparkException: Task failed while writing rows
> at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:261)
> at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply
> (InsertIntoHadoopFsRelationCommand.scala:143)
> at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply
> (InsertIntoHadoopFsRelationCommand.scala:143)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
> at org.apache.spark.scheduler.Task.run(Task.scala:85)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.ArrayIndexOutOfBoundsException: 1
> at 
> org.apache.spark.sql.execution.joins.LongToUnsafeRowMap.getValue(HashedRelation.scala:463)
> at 
> org.apache.spark.sql.execution.joins.LongHashedRelation.getValue(HashedRelation.scala:762)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
> at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply$mcV$sp(WriterContainer.scala:253)
> at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:252)
> at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:252)
> at 
> org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1325)
> at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:258)
> ... 8 more
> [Stage 32:=>  (10 + 5) / 
> 26]17/03/22 15:42:10 ERROR TaskSetManager: Task 4 in stage 32.0 
> failed 4 times; aborting job
> 17/03/22 15:42:10 ERROR InsertIntoHadoopFsRelationCommand: Aborting job.
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 4 in 
> stage 32.0 failed 4 times, most recent failure: Lost task 4.3 in 
> stage 32.0 (TID 857, ip-10-116-1-73.ec2.internal): 
> org.apache.spark.SparkException: Task failed while writing rows
> at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:261)
> at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply
> (InsertIntoHadoopFsRelationCommand.scala:143)
> at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply
> (InsertIntoHadoopFsRelationCommand.scala:143)
> at

[jira] [Commented] (SPARK-20108) Spark query is getting failed with exception

2017-04-03 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15953195#comment-15953195
 ] 

Hyukjin Kwon commented on SPARK-20108:
--

It seems almost impossible to reproduce to me. Do you mind if I ask a 
self-reproducer?

> Spark query is getting failed with exception
> 
>
> Key: SPARK-20108
> URL: https://issues.apache.org/jira/browse/SPARK-20108
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.0
>Reporter: ZS EDGE
>
> In our project we have implemented a logic where we programatically generate 
> spark queries. These queries are executed as a sub query and below is the 
> sample query--
> sqlContext.sql("INSERT INTO TABLE 
> test_client_r2_r2_2_prod_db1_oz.S3_EMPDTL_Incremental_invalid SELECT 
> 'S3_EMPDTL_Incremental',S3_EMPDTL_Incremental.row_id,S3_EMPDTL_Incremental.SOURCE_FILE_NAME,S3_EMPDTL_Incremental.SOURCE_ROW_ID,'S3_EMPDTL_Incremental','2017-03-22
>  
> 20:18:59','1','Emp_id#$Emp_name#$Emp_phone#$Emp_salary_in_K#$Emp_address_id#$Date_of_Birth#$Status#$Dept_id#$Date_of_joining#$Row_Number#$Dec_check#$','test','Y','N/A','',''
>  FROM S3_EMPDTL_Incremental_r AS S3_EMPDTL_Incremental where row_id IN 
> (select row_id from s3_empdtl_incremental_r where row_id IN(42949672960))")
> While executing the above code in the pyspark it is throwing below exception--
>  FAILS>>
> .spark.SparkException: Task failed while writing rows
> at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:261)
> at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply
> (InsertIntoHadoopFsRelationCommand.scala:143)
> at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply
> (InsertIntoHadoopFsRelationCommand.scala:143)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
> at org.apache.spark.scheduler.Task.run(Task.scala:85)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.ArrayIndexOutOfBoundsException: 1
> at 
> org.apache.spark.sql.execution.joins.LongToUnsafeRowMap.getValue(HashedRelation.scala:463)
> at 
> org.apache.spark.sql.execution.joins.LongHashedRelation.getValue(HashedRelation.scala:762)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
> at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply$mcV$sp(WriterContainer.scala:253)
> at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:252)
> at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:252)
> at 
> org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1325)
> at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:258)
> ... 8 more
> [Stage 32:=>  (10 + 5) / 
> 26]17/03/22 15:42:10 ERROR TaskSetManager: Task 4 in stage 32.0 
> failed 4 times; aborting job
> 17/03/22 15:42:10 ERROR InsertIntoHadoopFsRelationCommand: Aborting job.
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 4 in 
> stage 32.0 failed 4 times, most recent failure: Lost task 4.3 in 
> stage 32.0 (TID 857, ip-10-116-1-73.ec2.internal): 
> org.apache.spark.SparkException: Task failed while writing rows
> at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:261)
> at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply
> (InsertIntoHadoopFsRelationCommand.scala:143)
> at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply
> (InsertIntoHadoopFsRelationCommand.scala:143)
> at

[jira] [Updated] (SPARK-20180) Add a special value for unlimited max pattern length in Prefix span, and set it as default.

2017-04-03 Thread Cyril de Vogelaere (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cyril de Vogelaere updated SPARK-20180:
---
Summary: Add a special value for unlimited max pattern length in Prefix 
span, and set it as default.  (was: Unlimited max pattern length in Prefix span)

> Add a special value for unlimited max pattern length in Prefix span, and set 
> it as default.
> ---
>
> Key: SPARK-20180
> URL: https://issues.apache.org/jira/browse/SPARK-20180
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.1.0
>Reporter: Cyril de Vogelaere
>Priority: Minor
>   Original Estimate: 0h
>  Remaining Estimate: 0h
>
> Right now, we need to use .setMaxPatternLength() method to
> specify is the maximum pattern length of a sequence. Any pattern longer than 
> that won't be outputted.
> The current default maxPatternlength value being 10.
> This should be changed so that with input 0, all pattern of any length would 
> be outputted. Additionally, the default value should be changed to 0, so that 
> a new user could find all patterns in his dataset without looking at this 
> parameter.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-20185) csv decompressed incorrectly with extention other than 'gz'

2017-04-03 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15953183#comment-15953183
 ] 

Hyukjin Kwon edited comment on SPARK-20185 at 4/3/17 9:28 AM:
--

{{codec}} or {{compression}} is an option for writing out as documented.
It seems the workaround is not so difficult and the Hadoop's behaviour looks 
sensible to me as well.



was (Author: hyukjin.kwon):
{{codec}} or {{compression}} is an option for writing out as documented.
It seems the workaround is not so difficult and the behaviour looks reasonable 
to me as well.


> csv decompressed incorrectly with extention other than 'gz'
> ---
>
> Key: SPARK-20185
> URL: https://issues.apache.org/jira/browse/SPARK-20185
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 1.6.3, 2.0.0, 2.0.1, 2.0.2, 2.1.0
>Reporter: Ran Mingxuan
>Priority: Minor
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> With code below:
> val start_time = System.currentTimeMillis()
> val gzFile = spark.read
> .format("com.databricks.spark.csv")
> .option("header", "false")
> .option("inferSchema", "false")
> .option("codec", "gzip")
> .load("/foo/someCsvFile.gz.bak")
> gzFile.repartition(1).write.mode("overwrite").parquet("/foo/")
> got error even if I indicated the codec:
> WARN util.NativeCodeLoader: Unable to load native-hadoop library for your 
> platform... using builtin-java classes where applicable
> 17/03/23 15:44:55 WARN ipc.Client: Exception encountered while connecting to 
> the server : 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException):
>  Operation category READ is not supported in state standby. Visit 
> https://s.apache.org/sbnn-error
> 17/03/23 15:44:58 ERROR executor.Executor: Exception in task 2.0 in stage 
> 12.0 (TID 977)
> java.lang.NullPointerException
>   at 
> org.apache.spark.sql.execution.datasources.csv.CSVRelation$$anonfun$csvParser$3.apply(CSVRelation.scala:109)
>   at 
> org.apache.spark.sql.execution.datasources.csv.CSVRelation$$anonfun$csvParser$3.apply(CSVRelation.scala:94)
>   at 
> org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anonfun$buildReader$1$$anonfun$apply$2.apply(CSVFileFormat.scala:167)
>   at 
> org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anonfun$buildReader$1$$anonfun$apply$2.apply(CSVFileFormat.scala:166)
> Have to add extension to GzipCodec  to make my code run.
> import org.apache.hadoop.io.compress.GzipCodec
> class BakGzipCodec extends GzipCodec {
>   override def getDefaultExtension(): String = ".gz.bak"
> }
> I suppose the file loader should get file codec depending on option first, 
> and then to extension.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20185) csv decompressed incorrectly with extention other than 'gz'

2017-04-03 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15953183#comment-15953183
 ] 

Hyukjin Kwon commented on SPARK-20185:
--

{{codec}} or {{compression}} is an option for writing out as documented.
It seems the workaround is not so difficult and the behaviour looks reasonable 
to me as well.


> csv decompressed incorrectly with extention other than 'gz'
> ---
>
> Key: SPARK-20185
> URL: https://issues.apache.org/jira/browse/SPARK-20185
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 1.6.3, 2.0.0, 2.0.1, 2.0.2, 2.1.0
>Reporter: Ran Mingxuan
>Priority: Minor
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> With code below:
> val start_time = System.currentTimeMillis()
> val gzFile = spark.read
> .format("com.databricks.spark.csv")
> .option("header", "false")
> .option("inferSchema", "false")
> .option("codec", "gzip")
> .load("/foo/someCsvFile.gz.bak")
> gzFile.repartition(1).write.mode("overwrite").parquet("/foo/")
> got error even if I indicated the codec:
> WARN util.NativeCodeLoader: Unable to load native-hadoop library for your 
> platform... using builtin-java classes where applicable
> 17/03/23 15:44:55 WARN ipc.Client: Exception encountered while connecting to 
> the server : 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException):
>  Operation category READ is not supported in state standby. Visit 
> https://s.apache.org/sbnn-error
> 17/03/23 15:44:58 ERROR executor.Executor: Exception in task 2.0 in stage 
> 12.0 (TID 977)
> java.lang.NullPointerException
>   at 
> org.apache.spark.sql.execution.datasources.csv.CSVRelation$$anonfun$csvParser$3.apply(CSVRelation.scala:109)
>   at 
> org.apache.spark.sql.execution.datasources.csv.CSVRelation$$anonfun$csvParser$3.apply(CSVRelation.scala:94)
>   at 
> org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anonfun$buildReader$1$$anonfun$apply$2.apply(CSVFileFormat.scala:167)
>   at 
> org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anonfun$buildReader$1$$anonfun$apply$2.apply(CSVFileFormat.scala:166)
> Have to add extension to GzipCodec  to make my code run.
> import org.apache.hadoop.io.compress.GzipCodec
> class BakGzipCodec extends GzipCodec {
>   override def getDefaultExtension(): String = ".gz.bak"
> }
> I suppose the file loader should get file codec depending on option first, 
> and then to extension.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-9002) KryoSerializer initialization does not include 'Array[Int]'

2017-04-03 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-9002.
--
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 17482
[https://github.com/apache/spark/pull/17482]

> KryoSerializer initialization does not include 'Array[Int]'
> ---
>
> Key: SPARK-9002
> URL: https://issues.apache.org/jira/browse/SPARK-9002
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.4.0
> Environment: MacBook Pro, OS X 10.10.4, Spark 1.4.0, master=local[*], 
> IntelliJ IDEA.
>Reporter: Randy Kerber
>Priority: Minor
>  Labels: easyfix, newbie
> Fix For: 2.2.0
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> The object KryoSerializer (inside KryoRegistrator.scala) contains a list of 
> classes that are automatically registered with Kryo.  That list includes:
> Array\[Byte], Array\[Long], and Array\[Short].  Array\[Int] is missing from 
> that list.  Can't think of any good reason it shouldn't also be included.
> Note: This is first time creating an issue or contributing code to an apache 
> project. Apologies if I'm not following the process correct. Appreciate any 
> guidance or assistance.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20180) Unlimited max pattern length in Prefix span

2017-04-03 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15953166#comment-15953166
 ] 

Sean Owen commented on SPARK-20180:
---

Why not let the default be Int.MaxValue? if that's what this is about, update 
the title to reflect it.
This is a behavior change by default, so we should think carefully about it. 
What are the downsides -- why would someone have ever made it 10? presumably, 
performance.
I don't see you've benchmarked the impact of making this unlimited by default. 
You mention tests don't end and haven't established it's not due to your 
change. 
I don't think we can proceed with this in this state, right?

> Unlimited max pattern length in Prefix span
> ---
>
> Key: SPARK-20180
> URL: https://issues.apache.org/jira/browse/SPARK-20180
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.1.0
>Reporter: Cyril de Vogelaere
>Priority: Minor
>   Original Estimate: 0h
>  Remaining Estimate: 0h
>
> Right now, we need to use .setMaxPatternLength() method to
> specify is the maximum pattern length of a sequence. Any pattern longer than 
> that won't be outputted.
> The current default maxPatternlength value being 10.
> This should be changed so that with input 0, all pattern of any length would 
> be outputted. Additionally, the default value should be changed to 0, so that 
> a new user could find all patterns in his dataset without looking at this 
> parameter.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20166) Use XXX for ISO timezone instead of ZZ which is FastDateFormat specific in CSV/JSON time related options

2017-04-03 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-20166:
-

Assignee: Hyukjin Kwon
Priority: Minor  (was: Trivial)

> Use XXX for ISO timezone instead of ZZ which is FastDateFormat specific in 
> CSV/JSON time related options
> 
>
> Key: SPARK-20166
> URL: https://issues.apache.org/jira/browse/SPARK-20166
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
> Fix For: 2.2.0
>
>
> We can use {{XXX}} format instead of {{ZZ}}. {{ZZ}} seems a 
> {{FastDateFormat}} specific Please see 
> https://docs.oracle.com/javase/7/docs/api/java/text/SimpleDateFormat.html#iso8601timezone
>  and 
> https://commons.apache.org/proper/commons-lang/apidocs/org/apache/commons/lang3/time/FastDateFormat.html
> {{ZZ}} supports "ISO 8601 extended format time zones" but it seems 
> {{FastDateFormat}} specific option.
> It seems we better replace {{ZZ}} to {{XXX}} because they look use the same 
> strategy - 
> https://github.com/apache/commons-lang/blob/8767cd4f1a6af07093c1e6c422dae8e574be7e5e/src/main/java/org/apache/commons/lang3/time/FastDateParser.java#L930.
>  
> I also checked the codes and manually debugged it for sure. It seems both 
> cases use the same pattern {code}( Z|(?:[+-]\\d{2}(?::)\\d{2})) {code}.
> Note that this is a fix about documentation not the behaviour change because 
> {{ZZ}} seems invalid date format in {{SimpleDateFormat}} as documented in 
> {{DataFrameReader}}:
> {quote}
>* `timestampFormat` (default `-MM-dd'T'HH:mm:ss.SSSZZ`): sets the 
> string that
>* indicates a timestamp format. Custom date formats follow the formats at
>* `java.text.SimpleDateFormat`. This applies to timestamp type.
> {quote}
> {code}
> scala> new 
> java.text.SimpleDateFormat("-MM-dd'T'HH:mm:ss.SSSXXX").parse("2017-03-21T00:00:00.000-11:00")
> res4: java.util.Date = Tue Mar 21 20:00:00 KST 2017
> scala>  new 
> java.text.SimpleDateFormat("-MM-dd'T'HH:mm:ss.SSSXXX").parse("2017-03-21T00:00:00.000Z")
> res10: java.util.Date = Tue Mar 21 09:00:00 KST 2017
> scala> new 
> java.text.SimpleDateFormat("-MM-dd'T'HH:mm:ss.SSSZZ").parse("2017-03-21T00:00:00.000-11:00")
> java.text.ParseException: Unparseable date: "2017-03-21T00:00:00.000-11:00"
>   at java.text.DateFormat.parse(DateFormat.java:366)
>   ... 48 elided
> scala>  new 
> java.text.SimpleDateFormat("-MM-dd'T'HH:mm:ss.SSSZZ").parse("2017-03-21T00:00:00.000Z")
> java.text.ParseException: Unparseable date: "2017-03-21T00:00:00.000Z"
>   at java.text.DateFormat.parse(DateFormat.java:366)
>   ... 48 elided
> {code}
> {code}
> scala> 
> org.apache.commons.lang3.time.FastDateFormat.getInstance("-MM-dd'T'HH:mm:ss.SSSXXX").parse("2017-03-21T00:00:00.000-11:00")
> res7: java.util.Date = Tue Mar 21 20:00:00 KST 2017
> scala> 
> org.apache.commons.lang3.time.FastDateFormat.getInstance("-MM-dd'T'HH:mm:ss.SSSXXX").parse("2017-03-21T00:00:00.000Z")
> res1: java.util.Date = Tue Mar 21 09:00:00 KST 2017
> scala> 
> org.apache.commons.lang3.time.FastDateFormat.getInstance("-MM-dd'T'HH:mm:ss.SSSZZ").parse("2017-03-21T00:00:00.000-11:00")
> res8: java.util.Date = Tue Mar 21 20:00:00 KST 2017
> scala> 
> org.apache.commons.lang3.time.FastDateFormat.getInstance("-MM-dd'T'HH:mm:ss.SSSZZ").parse("2017-03-21T00:00:00.000Z")
> res2: java.util.Date = Tue Mar 21 09:00:00 KST 2017
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-20166) Use XXX for ISO timezone instead of ZZ which is FastDateFormat specific in CSV/JSON time related options

2017-04-03 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-20166.
---
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 17489
[https://github.com/apache/spark/pull/17489]

> Use XXX for ISO timezone instead of ZZ which is FastDateFormat specific in 
> CSV/JSON time related options
> 
>
> Key: SPARK-20166
> URL: https://issues.apache.org/jira/browse/SPARK-20166
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Hyukjin Kwon
>Priority: Trivial
> Fix For: 2.2.0
>
>
> We can use {{XXX}} format instead of {{ZZ}}. {{ZZ}} seems a 
> {{FastDateFormat}} specific Please see 
> https://docs.oracle.com/javase/7/docs/api/java/text/SimpleDateFormat.html#iso8601timezone
>  and 
> https://commons.apache.org/proper/commons-lang/apidocs/org/apache/commons/lang3/time/FastDateFormat.html
> {{ZZ}} supports "ISO 8601 extended format time zones" but it seems 
> {{FastDateFormat}} specific option.
> It seems we better replace {{ZZ}} to {{XXX}} because they look use the same 
> strategy - 
> https://github.com/apache/commons-lang/blob/8767cd4f1a6af07093c1e6c422dae8e574be7e5e/src/main/java/org/apache/commons/lang3/time/FastDateParser.java#L930.
>  
> I also checked the codes and manually debugged it for sure. It seems both 
> cases use the same pattern {code}( Z|(?:[+-]\\d{2}(?::)\\d{2})) {code}.
> Note that this is a fix about documentation not the behaviour change because 
> {{ZZ}} seems invalid date format in {{SimpleDateFormat}} as documented in 
> {{DataFrameReader}}:
> {quote}
>* `timestampFormat` (default `-MM-dd'T'HH:mm:ss.SSSZZ`): sets the 
> string that
>* indicates a timestamp format. Custom date formats follow the formats at
>* `java.text.SimpleDateFormat`. This applies to timestamp type.
> {quote}
> {code}
> scala> new 
> java.text.SimpleDateFormat("-MM-dd'T'HH:mm:ss.SSSXXX").parse("2017-03-21T00:00:00.000-11:00")
> res4: java.util.Date = Tue Mar 21 20:00:00 KST 2017
> scala>  new 
> java.text.SimpleDateFormat("-MM-dd'T'HH:mm:ss.SSSXXX").parse("2017-03-21T00:00:00.000Z")
> res10: java.util.Date = Tue Mar 21 09:00:00 KST 2017
> scala> new 
> java.text.SimpleDateFormat("-MM-dd'T'HH:mm:ss.SSSZZ").parse("2017-03-21T00:00:00.000-11:00")
> java.text.ParseException: Unparseable date: "2017-03-21T00:00:00.000-11:00"
>   at java.text.DateFormat.parse(DateFormat.java:366)
>   ... 48 elided
> scala>  new 
> java.text.SimpleDateFormat("-MM-dd'T'HH:mm:ss.SSSZZ").parse("2017-03-21T00:00:00.000Z")
> java.text.ParseException: Unparseable date: "2017-03-21T00:00:00.000Z"
>   at java.text.DateFormat.parse(DateFormat.java:366)
>   ... 48 elided
> {code}
> {code}
> scala> 
> org.apache.commons.lang3.time.FastDateFormat.getInstance("-MM-dd'T'HH:mm:ss.SSSXXX").parse("2017-03-21T00:00:00.000-11:00")
> res7: java.util.Date = Tue Mar 21 20:00:00 KST 2017
> scala> 
> org.apache.commons.lang3.time.FastDateFormat.getInstance("-MM-dd'T'HH:mm:ss.SSSXXX").parse("2017-03-21T00:00:00.000Z")
> res1: java.util.Date = Tue Mar 21 09:00:00 KST 2017
> scala> 
> org.apache.commons.lang3.time.FastDateFormat.getInstance("-MM-dd'T'HH:mm:ss.SSSZZ").parse("2017-03-21T00:00:00.000-11:00")
> res8: java.util.Date = Tue Mar 21 20:00:00 KST 2017
> scala> 
> org.apache.commons.lang3.time.FastDateFormat.getInstance("-MM-dd'T'HH:mm:ss.SSSZZ").parse("2017-03-21T00:00:00.000Z")
> res2: java.util.Date = Tue Mar 21 09:00:00 KST 2017
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20180) Unlimited max pattern length in Prefix span

2017-04-03 Thread Cyril de Vogelaere (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15953160#comment-15953160
 ] 

Cyril de Vogelaere commented on SPARK-20180:


Can you not just set a very large max, like Int.MaxValue or similar?
=> Yes, I said that in the fourth paragraph of my last coment. A carefull user 
could always set it to Int.MaxValue and never have problems in an empirical 
situation. Still, it doesn't change the fact that I advocate for that special 
value (0) as the default value. Since it would be nice that, at first run and 
no matter the dataset, all solution pattern are found. Even if they are longer 
than 10.

It's not normal for tests to run more than a couple hours. You need to see why. 
Is your test of unlimited max pattern stuck?
=> It's not my test per say, it's the dev/run-tests tests which are asked to 
run before creating a pull resquest. I tested with my few changes and it ran 
for a day and a half, I'm re-running it now on the current state of the lib, 
without my changes, it doesn't seem faster ... for now at least ...
So I'm pretty sure I didn't screw up on that, for now the errors seem the same 
too, but I didn't take a deep look at them.

> Unlimited max pattern length in Prefix span
> ---
>
> Key: SPARK-20180
> URL: https://issues.apache.org/jira/browse/SPARK-20180
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.1.0
>Reporter: Cyril de Vogelaere
>Priority: Minor
>   Original Estimate: 0h
>  Remaining Estimate: 0h
>
> Right now, we need to use .setMaxPatternLength() method to
> specify is the maximum pattern length of a sequence. Any pattern longer than 
> that won't be outputted.
> The current default maxPatternlength value being 10.
> This should be changed so that with input 0, all pattern of any length would 
> be outputted. Additionally, the default value should be changed to 0, so that 
> a new user could find all patterns in his dataset without looking at this 
> parameter.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15352) Topology aware block replication

2017-04-03 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15953153#comment-15953153
 ] 

Apache Spark commented on SPARK-15352:
--

User 'lins05' has created a pull request for this issue:
https://github.com/apache/spark/pull/17519

> Topology aware block replication
> 
>
> Key: SPARK-15352
> URL: https://issues.apache.org/jira/browse/SPARK-15352
> Project: Spark
>  Issue Type: New Feature
>  Components: Block Manager, Mesos, Spark Core, YARN
>Reporter: Shubham Chopra
>Assignee: Shubham Chopra
>
> With cached RDDs, Spark can be used for online analytics where it is used to 
> respond to online queries. But loss of RDD partitions due to node/executor 
> failures can cause huge delays in such use cases as the data would have to be 
> regenerated.
> Cached RDDs, even when using multiple replicas per block, are not currently 
> resilient to node failures when multiple executors are started on the same 
> node. Block replication currently chooses a peer at random, and this peer 
> could also exist on the same host. 
> This effort would add topology aware replication to Spark that can be enabled 
> with pluggable strategies. For ease of development/review, this is being 
> broken down to three major work-efforts:
> 1.Making peer selection for replication pluggable
> 2.Providing pluggable implementations for providing topology and topology 
> aware replication
> 3.Pro-active replenishment of lost blocks



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15352) Topology aware block replication

2017-04-03 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15352:


Assignee: Shubham Chopra  (was: Apache Spark)

> Topology aware block replication
> 
>
> Key: SPARK-15352
> URL: https://issues.apache.org/jira/browse/SPARK-15352
> Project: Spark
>  Issue Type: New Feature
>  Components: Block Manager, Mesos, Spark Core, YARN
>Reporter: Shubham Chopra
>Assignee: Shubham Chopra
>
> With cached RDDs, Spark can be used for online analytics where it is used to 
> respond to online queries. But loss of RDD partitions due to node/executor 
> failures can cause huge delays in such use cases as the data would have to be 
> regenerated.
> Cached RDDs, even when using multiple replicas per block, are not currently 
> resilient to node failures when multiple executors are started on the same 
> node. Block replication currently chooses a peer at random, and this peer 
> could also exist on the same host. 
> This effort would add topology aware replication to Spark that can be enabled 
> with pluggable strategies. For ease of development/review, this is being 
> broken down to three major work-efforts:
> 1.Making peer selection for replication pluggable
> 2.Providing pluggable implementations for providing topology and topology 
> aware replication
> 3.Pro-active replenishment of lost blocks



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15352) Topology aware block replication

2017-04-03 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15352:


Assignee: Apache Spark  (was: Shubham Chopra)

> Topology aware block replication
> 
>
> Key: SPARK-15352
> URL: https://issues.apache.org/jira/browse/SPARK-15352
> Project: Spark
>  Issue Type: New Feature
>  Components: Block Manager, Mesos, Spark Core, YARN
>Reporter: Shubham Chopra
>Assignee: Apache Spark
>
> With cached RDDs, Spark can be used for online analytics where it is used to 
> respond to online queries. But loss of RDD partitions due to node/executor 
> failures can cause huge delays in such use cases as the data would have to be 
> regenerated.
> Cached RDDs, even when using multiple replicas per block, are not currently 
> resilient to node failures when multiple executors are started on the same 
> node. Block replication currently chooses a peer at random, and this peer 
> could also exist on the same host. 
> This effort would add topology aware replication to Spark that can be enabled 
> with pluggable strategies. For ease of development/review, this is being 
> broken down to three major work-efforts:
> 1.Making peer selection for replication pluggable
> 2.Providing pluggable implementations for providing topology and topology 
> aware replication
> 3.Pro-active replenishment of lost blocks



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19985) Some ML Models error when copy or do not set parent

2017-04-03 Thread Nick Pentreath (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath reassigned SPARK-19985:
--

Assignee: Bryan Cutler

> Some ML Models error when copy or do not set parent
> ---
>
> Key: SPARK-19985
> URL: https://issues.apache.org/jira/browse/SPARK-19985
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
> Fix For: 2.2.0
>
>
> Some ML Models fail when copied due to not having a default constructor and 
> implementing {{copy}} with {{defaultCopy}}.  Other cases do not properly set 
> the parent when the model is copied.  These models were missing the normal 
> check that tests for these in the test suites.
> Models with issues are:
> * RFormlaModel
> * MultilayerPerceptronClassificationModel
> * BucketedRandomProjectionLSHModel
> * MinHashLSH



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-19985) Some ML Models error when copy or do not set parent

2017-04-03 Thread Nick Pentreath (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath resolved SPARK-19985.

   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 17326
[https://github.com/apache/spark/pull/17326]

> Some ML Models error when copy or do not set parent
> ---
>
> Key: SPARK-19985
> URL: https://issues.apache.org/jira/browse/SPARK-19985
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Bryan Cutler
> Fix For: 2.2.0
>
>
> Some ML Models fail when copied due to not having a default constructor and 
> implementing {{copy}} with {{defaultCopy}}.  Other cases do not properly set 
> the parent when the model is copied.  These models were missing the normal 
> check that tests for these in the test suites.
> Models with issues are:
> * RFormlaModel
> * MultilayerPerceptronClassificationModel
> * BucketedRandomProjectionLSHModel
> * MinHashLSH



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 >

1 - 100 of 119 matches

Mail list logo