[jira] [Commented] (SPARK-18736) CreateMap allows non-unique keys

2017-02-02 Thread Shuai Lin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15851057#comment-15851057
 ] 

Shuai Lin commented on SPARK-18736:
---

[~eyalfa] How is it going on? I can work on this one if you're ok.

> CreateMap allows non-unique keys
> 
>
> Key: SPARK-18736
> URL: https://issues.apache.org/jira/browse/SPARK-18736
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Eyal Farago
>  Labels: map, sql, types
>
> Spark-Sql, {{CreateMap}} does not enforce unique keys, i.e. it's possible to 
> create a map with two identical keys: 
> {noformat}
> CreateMap(Literal(1), Literal(11), Literal(1), Literal(12))
> {noformat}
> This does not behave like standard maps in common programming languages.
> proper behavior should be chosen:
> # first 'wins'
> # last 'wins'
> # runtime error.
> {{GetMapValue}} currently implements option #1. Even if this is the desired 
> behavior {{CreateMap}} should return a unique map.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14098) Generate code that get a float/double value in each column from CachedBatch when DataFrame.cache() is called

2017-01-28 Thread Shuai Lin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15843996#comment-15843996
 ] 

Shuai Lin commented on SPARK-14098:
---

[~kiszk] Seems the title/description of this ticket is not on par with what is 
done in https://github.com/apache/spark/pull/15219 . Should we update the 
title/description here?

> Generate code that get a float/double value in each column from CachedBatch 
> when DataFrame.cache() is called
> 
>
> Key: SPARK-14098
> URL: https://issues.apache.org/jira/browse/SPARK-14098
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Kazuaki Ishizaki
>
> When DataFrame.cache() is called, data is stored as column-oriented storage 
> in CachedBatch. The current Catalyst generates Java program to get a value of 
> a column from an InternalRow that is translated from CachedBatch. This issue 
> generates Java code to get a value of a column from CachedBatch. While a 
> column for a cache may be compressed, this issue handles float and double 
> types that are never compressed. 
> Other primitive types, whose column may be compressed, will be addressed in 
> another entry.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19153) DataFrameWriter.saveAsTable should work with hive format to create partitioned table

2017-01-16 Thread Shuai Lin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15824935#comment-15824935
 ] 

Shuai Lin commented on SPARK-19153:
---

Never mind. I can help review it instead.

> DataFrameWriter.saveAsTable should work with hive format to create 
> partitioned table
> 
>
> Key: SPARK-19153
> URL: https://issues.apache.org/jira/browse/SPARK-19153
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19153) DataFrameWriter.saveAsTable should work with hive format to create partitioned table

2017-01-15 Thread Shuai Lin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15823552#comment-15823552
 ] 

Shuai Lin commented on SPARK-19153:
---

[~windpiger] I planned to sent a PR today, just to see you already did that. 
May I suggest you to left a comment before starting to work on a ticket, so we 
don't step on each other's toe?

> DataFrameWriter.saveAsTable should work with hive format to create 
> partitioned table
> 
>
> Key: SPARK-19153
> URL: https://issues.apache.org/jira/browse/SPARK-19153
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19153) DataFrameWriter.saveAsTable should work with hive format to create partitioned table

2017-01-15 Thread Shuai Lin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15823141#comment-15823141
 ] 

Shuai Lin commented on SPARK-19153:
---

bq. To clarify, we want this feature in DataFrameWriter and the official CREATE 
TABLE SQL statement, the legacy CREATE TABLE hive syntax is not our goal.

Thanks for the reply and I agree with this. But TBH I don't understand your 
opinion of whether the summary I given above is correct or not. Can you be more 
clear on it?

> DataFrameWriter.saveAsTable should work with hive format to create 
> partitioned table
> 
>
> Key: SPARK-19153
> URL: https://issues.apache.org/jira/browse/SPARK-19153
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19153) DataFrameWriter.saveAsTable should work with hive format to create partitioned table

2017-01-15 Thread Shuai Lin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15823103#comment-15823103
 ] 

Shuai Lin commented on SPARK-19153:
---

I find it's quite straight forward to remove the restriction of partitioned-by 
for the {{create table t1 using hive partitioned by (c1,c2) as select ..."}} 
CTAS statement.

But another problem comes up: the partition columns must be on the right most 
of the schema, otherwise the schema we stored in the table property of 
metastore (with the property key "spark.sql.sources.schema") would be 
inconsistent with the schema we read back from hive client api.

The reason is, when creating a hive table in the metastore, the schema and 
partition columns are disjoint sets (as required by hive client api). And when 
we reading it back, we append the partition columns to the end of the schema to 
get the catalyst schema, i.e.:
{code}
// HiveClientImpl.scala
val partCols = h.getPartCols.asScala.map(fromHiveColumn)
val schema = StructType(h.getCols.asScala.map(fromHiveColumn) ++ partCols)
{code}
It's not a problem before we have the unified "create table" syntax, because in 
the old create hive table syntax we have to specify the normal columns and 
partition columns separately, e.g. {{create table t1 (id int, name string) 
partitioned by (dept string)}} .

Now that we can create partitioned table using hive format, e.g. {{create table 
t1 (id int, name string, dept string) using hive partitioned by (name)}}, the 
partition column may not be the last columns, so I think we need to reorder the 
schema so the partition columns would be the last ones. This is consistent with 
data source tables, e.g.

{code}
scala> sql("create table t1 (id int, name string, dept string) using parquet 
partitioned by (name)")
scala> spark.table("t1").schema.fields.map(_.name)
res44: Array[String] = Array(id, dept, name)
{code}

[~cloud_fan] Does this sound good to you?


> DataFrameWriter.saveAsTable should work with hive format to create 
> partitioned table
> 
>
> Key: SPARK-19153
> URL: https://issues.apache.org/jira/browse/SPARK-19153
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19153) DataFrameWriter.saveAsTable should work with hive format to create partitioned table

2017-01-13 Thread Shuai Lin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15822661#comment-15822661
 ] 

Shuai Lin commented on SPARK-19153:
---

I'm working on this ticket, thanks.

> DataFrameWriter.saveAsTable should work with hive format to create 
> partitioned table
> 
>
> Key: SPARK-19153
> URL: https://issues.apache.org/jira/browse/SPARK-19153
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17101) Provide consistent format identifiers for TextFileFormat and ParquetFileFormat

2017-01-11 Thread Shuai Lin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15818648#comment-15818648
 ] 

Shuai Lin commented on SPARK-17101:
---

Seems this issue has already been resolved by 
https://github.com/apache/spark/pull/14680 ? cc [~rxin]

> Provide consistent format identifiers for TextFileFormat and ParquetFileFormat
> --
>
> Key: SPARK-17101
> URL: https://issues.apache.org/jira/browse/SPARK-17101
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Jacek Laskowski
>Priority: Trivial
>
> Define the format identifier that is used in {{Optimized Logical Plan}} in 
> {{explain}} for {{text}} file format.
> {code}
> scala> spark.read.text("people.csv").cache.explain(extended = true)
> ...
> == Optimized Logical Plan ==
> InMemoryRelation [value#24], true, 1, StorageLevel(disk, memory, 
> deserialized, 1 replicas)
>+- *FileScan text [value#24] Batched: false, Format: 
> org.apache.spark.sql.execution.datasources.text.TextFileFormat@262e2c8c, 
> InputPaths: file:/Users/jacek/dev/oss/spark/people.csv, PartitionFilters: [], 
> PushedFilters: [], ReadSchema: struct
> == Physical Plan ==
> InMemoryTableScan [value#24]
>+- InMemoryRelation [value#24], true, 1, StorageLevel(disk, memory, 
> deserialized, 1 replicas)
>  +- *FileScan text [value#24] Batched: false, Format: 
> org.apache.spark.sql.execution.datasources.text.TextFileFormat@262e2c8c, 
> InputPaths: file:/Users/jacek/dev/oss/spark/people.csv, PartitionFilters: [], 
> PushedFilters: [], ReadSchema: struct
> {code}
> When you {{explain}} csv format you can see {{Format: CSV}}.
> {code}
> scala> spark.read.csv("people.csv").cache.explain(extended = true)
> == Parsed Logical Plan ==
> Relation[_c0#39,_c1#40,_c2#41,_c3#42] csv
> == Analyzed Logical Plan ==
> _c0: string, _c1: string, _c2: string, _c3: string
> Relation[_c0#39,_c1#40,_c2#41,_c3#42] csv
> == Optimized Logical Plan ==
> InMemoryRelation [_c0#39, _c1#40, _c2#41, _c3#42], true, 1, 
> StorageLevel(disk, memory, deserialized, 1 replicas)
>+- *FileScan csv [_c0#39,_c1#40,_c2#41,_c3#42] Batched: false, Format: 
> CSV, InputPaths: file:/Users/jacek/dev/oss/spark/people.csv, 
> PartitionFilters: [], PushedFilters: [], ReadSchema: 
> struct<_c0:string,_c1:string,_c2:string,_c3:string>
> == Physical Plan ==
> InMemoryTableScan [_c0#39, _c1#40, _c2#41, _c3#42]
>+- InMemoryRelation [_c0#39, _c1#40, _c2#41, _c3#42], true, 1, 
> StorageLevel(disk, memory, deserialized, 1 replicas)
>  +- *FileScan csv [_c0#39,_c1#40,_c2#41,_c3#42] Batched: false, 
> Format: CSV, InputPaths: file:/Users/jacek/dev/oss/spark/people.csv, 
> PartitionFilters: [], PushedFilters: [], ReadSchema: 
> struct<_c0:string,_c1:string,_c2:string,_c3:string>
> {code}
> The custom format is defined for JSON, too.
> {code}
> scala> spark.read.json("people.csv").cache.explain(extended = true)
> == Parsed Logical Plan ==
> Relation[_corrupt_record#93] json
> == Analyzed Logical Plan ==
> _corrupt_record: string
> Relation[_corrupt_record#93] json
> == Optimized Logical Plan ==
> InMemoryRelation [_corrupt_record#93], true, 1, StorageLevel(disk, 
> memory, deserialized, 1 replicas)
>+- *FileScan json [_corrupt_record#93] Batched: false, Format: JSON, 
> InputPaths: file:/Users/jacek/dev/oss/spark/people.csv, PartitionFilters: [], 
> PushedFilters: [], ReadSchema: struct<_corrupt_record:string>
> == Physical Plan ==
> InMemoryTableScan [_corrupt_record#93]
>+- InMemoryRelation [_corrupt_record#93], true, 1, StorageLevel(disk, 
> memory, deserialized, 1 replicas)
>  +- *FileScan json [_corrupt_record#93] Batched: false, Format: JSON, 
> InputPaths: file:/Users/jacek/dev/oss/spark/people.csv, PartitionFilters: [], 
> PushedFilters: [], ReadSchema: struct<_corrupt_record:string>
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19123) KeyProviderException when reading Azure Blobs from Apache Spark

2017-01-07 Thread Shuai Lin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shuai Lin updated SPARK-19123:
--
Flags:   (was: Important)

> KeyProviderException when reading Azure Blobs from Apache Spark
> ---
>
> Key: SPARK-19123
> URL: https://issues.apache.org/jira/browse/SPARK-19123
> Project: Spark
>  Issue Type: Question
>  Components: Input/Output, Java API
>Affects Versions: 2.0.0
> Environment: Apache Spark 2.0.0 running on Azure HDInsight cluster 
> version 3.5 with Hadoop version 2.7.3
>Reporter: Saulo Ricci
>Priority: Minor
>  Labels: newbie
>
> I created a Spark job and it's intended to read a set of json files from a 
> Azure Blob container. I set the key and reference to my storage and I'm 
> reading the files as showed in the snippet bellow:
> {code:java}
> SparkSession
> sparkSession =
> SparkSession.builder().appName("Pipeline")
> .master("yarn")
> .config("fs.azure", 
> "org.apache.hadoop.fs.azure.NativeAzureFileSystem")
> 
> .config("fs.azure.account.key..blob.core.windows.net","")
> .getOrCreate();
> Dataset txs = sparkSession.read().json("wasb://path_to_files");
> {code}
> The point is that I'm unfortunately getting a 
> `org.apache.hadoop.fs.azure.KeyProviderException` when reading the blobs from 
> the azure storage. According to the trace showed bellow it seems the header 
> too long but still trying to figure out what exactly that means:
> {code:java}
> 17/01/07 19:28:39 ERROR ApplicationMaster: User class threw exception: 
> org.apache.hadoop.fs.azure.AzureException: 
> org.apache.hadoop.fs.azure.KeyProviderException: ExitCodeException 
> exitCode=2: Error reading S/MIME message
> 140473279682200:error:0D07207B:asn1 encoding 
> routines:ASN1_get_object:header too long:asn1_lib.c:157:
> 140473279682200:error:0D0D106E:asn1 encoding 
> routines:B64_READ_ASN1:decode error:asn_mime.c:192:
> 140473279682200:error:0D0D40CB:asn1 encoding 
> routines:SMIME_read_ASN1:asn1 parse error:asn_mime.c:517:
> org.apache.hadoop.fs.azure.AzureException: 
> org.apache.hadoop.fs.azure.KeyProviderException: ExitCodeException 
> exitCode=2: Error reading S/MIME message
> 140473279682200:error:0D07207B:asn1 encoding 
> routines:ASN1_get_object:header too long:asn1_lib.c:157:
> 140473279682200:error:0D0D106E:asn1 encoding 
> routines:B64_READ_ASN1:decode error:asn_mime.c:192:
> 140473279682200:error:0D0D40CB:asn1 encoding 
> routines:SMIME_read_ASN1:asn1 parse error:asn_mime.c:517:
>   at 
> org.apache.hadoop.fs.azure.AzureNativeFileSystemStore.createAzureStorageSession(AzureNativeFileSystemStore.java:953)
>   at 
> org.apache.hadoop.fs.azure.AzureNativeFileSystemStore.initialize(AzureNativeFileSystemStore.java:450)
>   at 
> org.apache.hadoop.fs.azure.NativeAzureFileSystem.initialize(NativeAzureFileSystem.java:1209)
>   at 
> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2761)
>   at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:99)
>   at 
> org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2795)
>   at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2777)
>   at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:386)
>   at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$12.apply(DataSource.scala:366)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$12.apply(DataSource.scala:364)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at 
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
>   at scala.collection.immutable.List.flatMap(List.scala:344)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:364)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:149)
>   at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:294)
>   at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:249)
>   at 
> taka.pipelines.AnomalyTrainingPipeline.main(AnomalyTrainingPipeline.java:35)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 

[jira] [Commented] (SPARK-19123) KeyProviderException when reading Azure Blobs from Apache Spark

2017-01-07 Thread Shuai Lin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15808663#comment-15808663
 ] 

Shuai Lin commented on SPARK-19123:
---

IIUC {{KeyProviderException}} means the storage account key is not configured 
properly. Are you sure the way you specify the key is correct? Have you checked 
the azure developer docs for it?

BTW I don't think this is an "critical issue", so I changed it to "minor".

> KeyProviderException when reading Azure Blobs from Apache Spark
> ---
>
> Key: SPARK-19123
> URL: https://issues.apache.org/jira/browse/SPARK-19123
> Project: Spark
>  Issue Type: Question
>  Components: Input/Output, Java API
>Affects Versions: 2.0.0
> Environment: Apache Spark 2.0.0 running on Azure HDInsight cluster 
> version 3.5 with Hadoop version 2.7.3
>Reporter: Saulo Ricci
>Priority: Minor
>  Labels: newbie
>
> I created a Spark job and it's intended to read a set of json files from a 
> Azure Blob container. I set the key and reference to my storage and I'm 
> reading the files as showed in the snippet bellow:
> {code:java}
> SparkSession
> sparkSession =
> SparkSession.builder().appName("Pipeline")
> .master("yarn")
> .config("fs.azure", 
> "org.apache.hadoop.fs.azure.NativeAzureFileSystem")
> 
> .config("fs.azure.account.key..blob.core.windows.net","")
> .getOrCreate();
> Dataset txs = sparkSession.read().json("wasb://path_to_files");
> {code}
> The point is that I'm unfortunately getting a 
> `org.apache.hadoop.fs.azure.KeyProviderException` when reading the blobs from 
> the azure storage. According to the trace showed bellow it seems the header 
> too long but still trying to figure out what exactly that means:
> {code:java}
> 17/01/07 19:28:39 ERROR ApplicationMaster: User class threw exception: 
> org.apache.hadoop.fs.azure.AzureException: 
> org.apache.hadoop.fs.azure.KeyProviderException: ExitCodeException 
> exitCode=2: Error reading S/MIME message
> 140473279682200:error:0D07207B:asn1 encoding 
> routines:ASN1_get_object:header too long:asn1_lib.c:157:
> 140473279682200:error:0D0D106E:asn1 encoding 
> routines:B64_READ_ASN1:decode error:asn_mime.c:192:
> 140473279682200:error:0D0D40CB:asn1 encoding 
> routines:SMIME_read_ASN1:asn1 parse error:asn_mime.c:517:
> org.apache.hadoop.fs.azure.AzureException: 
> org.apache.hadoop.fs.azure.KeyProviderException: ExitCodeException 
> exitCode=2: Error reading S/MIME message
> 140473279682200:error:0D07207B:asn1 encoding 
> routines:ASN1_get_object:header too long:asn1_lib.c:157:
> 140473279682200:error:0D0D106E:asn1 encoding 
> routines:B64_READ_ASN1:decode error:asn_mime.c:192:
> 140473279682200:error:0D0D40CB:asn1 encoding 
> routines:SMIME_read_ASN1:asn1 parse error:asn_mime.c:517:
>   at 
> org.apache.hadoop.fs.azure.AzureNativeFileSystemStore.createAzureStorageSession(AzureNativeFileSystemStore.java:953)
>   at 
> org.apache.hadoop.fs.azure.AzureNativeFileSystemStore.initialize(AzureNativeFileSystemStore.java:450)
>   at 
> org.apache.hadoop.fs.azure.NativeAzureFileSystem.initialize(NativeAzureFileSystem.java:1209)
>   at 
> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2761)
>   at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:99)
>   at 
> org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2795)
>   at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2777)
>   at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:386)
>   at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$12.apply(DataSource.scala:366)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$12.apply(DataSource.scala:364)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at 
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
>   at scala.collection.immutable.List.flatMap(List.scala:344)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:364)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:149)
>   at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:294)
>   at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:249)
>   at 
> taka.pipelines.AnomalyTrainingPipeline.main(AnomalyTrainingPipeline.java:35)
>   at 

[jira] [Updated] (SPARK-19123) KeyProviderException when reading Azure Blobs from Apache Spark

2017-01-07 Thread Shuai Lin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shuai Lin updated SPARK-19123:
--
Labels: newbie  (was: features newbie test)

> KeyProviderException when reading Azure Blobs from Apache Spark
> ---
>
> Key: SPARK-19123
> URL: https://issues.apache.org/jira/browse/SPARK-19123
> Project: Spark
>  Issue Type: Question
>  Components: Input/Output, Java API
>Affects Versions: 2.0.0
> Environment: Apache Spark 2.0.0 running on Azure HDInsight cluster 
> version 3.5 with Hadoop version 2.7.3
>Reporter: Saulo Ricci
>Priority: Minor
>  Labels: newbie
>
> I created a Spark job and it's intended to read a set of json files from a 
> Azure Blob container. I set the key and reference to my storage and I'm 
> reading the files as showed in the snippet bellow:
> {code:java}
> SparkSession
> sparkSession =
> SparkSession.builder().appName("Pipeline")
> .master("yarn")
> .config("fs.azure", 
> "org.apache.hadoop.fs.azure.NativeAzureFileSystem")
> 
> .config("fs.azure.account.key..blob.core.windows.net","")
> .getOrCreate();
> Dataset txs = sparkSession.read().json("wasb://path_to_files");
> {code}
> The point is that I'm unfortunately getting a 
> `org.apache.hadoop.fs.azure.KeyProviderException` when reading the blobs from 
> the azure storage. According to the trace showed bellow it seems the header 
> too long but still trying to figure out what exactly that means:
> {code:java}
> 17/01/07 19:28:39 ERROR ApplicationMaster: User class threw exception: 
> org.apache.hadoop.fs.azure.AzureException: 
> org.apache.hadoop.fs.azure.KeyProviderException: ExitCodeException 
> exitCode=2: Error reading S/MIME message
> 140473279682200:error:0D07207B:asn1 encoding 
> routines:ASN1_get_object:header too long:asn1_lib.c:157:
> 140473279682200:error:0D0D106E:asn1 encoding 
> routines:B64_READ_ASN1:decode error:asn_mime.c:192:
> 140473279682200:error:0D0D40CB:asn1 encoding 
> routines:SMIME_read_ASN1:asn1 parse error:asn_mime.c:517:
> org.apache.hadoop.fs.azure.AzureException: 
> org.apache.hadoop.fs.azure.KeyProviderException: ExitCodeException 
> exitCode=2: Error reading S/MIME message
> 140473279682200:error:0D07207B:asn1 encoding 
> routines:ASN1_get_object:header too long:asn1_lib.c:157:
> 140473279682200:error:0D0D106E:asn1 encoding 
> routines:B64_READ_ASN1:decode error:asn_mime.c:192:
> 140473279682200:error:0D0D40CB:asn1 encoding 
> routines:SMIME_read_ASN1:asn1 parse error:asn_mime.c:517:
>   at 
> org.apache.hadoop.fs.azure.AzureNativeFileSystemStore.createAzureStorageSession(AzureNativeFileSystemStore.java:953)
>   at 
> org.apache.hadoop.fs.azure.AzureNativeFileSystemStore.initialize(AzureNativeFileSystemStore.java:450)
>   at 
> org.apache.hadoop.fs.azure.NativeAzureFileSystem.initialize(NativeAzureFileSystem.java:1209)
>   at 
> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2761)
>   at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:99)
>   at 
> org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2795)
>   at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2777)
>   at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:386)
>   at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$12.apply(DataSource.scala:366)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$12.apply(DataSource.scala:364)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at 
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
>   at scala.collection.immutable.List.flatMap(List.scala:344)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:364)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:149)
>   at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:294)
>   at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:249)
>   at 
> taka.pipelines.AnomalyTrainingPipeline.main(AnomalyTrainingPipeline.java:35)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at 

[jira] [Updated] (SPARK-19123) KeyProviderException when reading Azure Blobs from Apache Spark

2017-01-07 Thread Shuai Lin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shuai Lin updated SPARK-19123:
--
Priority: Minor  (was: Critical)

> KeyProviderException when reading Azure Blobs from Apache Spark
> ---
>
> Key: SPARK-19123
> URL: https://issues.apache.org/jira/browse/SPARK-19123
> Project: Spark
>  Issue Type: Question
>  Components: Input/Output, Java API
>Affects Versions: 2.0.0
> Environment: Apache Spark 2.0.0 running on Azure HDInsight cluster 
> version 3.5 with Hadoop version 2.7.3
>Reporter: Saulo Ricci
>Priority: Minor
>  Labels: features, newbie, test
>
> I created a Spark job and it's intended to read a set of json files from a 
> Azure Blob container. I set the key and reference to my storage and I'm 
> reading the files as showed in the snippet bellow:
> {code:java}
> SparkSession
> sparkSession =
> SparkSession.builder().appName("Pipeline")
> .master("yarn")
> .config("fs.azure", 
> "org.apache.hadoop.fs.azure.NativeAzureFileSystem")
> 
> .config("fs.azure.account.key..blob.core.windows.net","")
> .getOrCreate();
> Dataset txs = sparkSession.read().json("wasb://path_to_files");
> {code}
> The point is that I'm unfortunately getting a 
> `org.apache.hadoop.fs.azure.KeyProviderException` when reading the blobs from 
> the azure storage. According to the trace showed bellow it seems the header 
> too long but still trying to figure out what exactly that means:
> {code:java}
> 17/01/07 19:28:39 ERROR ApplicationMaster: User class threw exception: 
> org.apache.hadoop.fs.azure.AzureException: 
> org.apache.hadoop.fs.azure.KeyProviderException: ExitCodeException 
> exitCode=2: Error reading S/MIME message
> 140473279682200:error:0D07207B:asn1 encoding 
> routines:ASN1_get_object:header too long:asn1_lib.c:157:
> 140473279682200:error:0D0D106E:asn1 encoding 
> routines:B64_READ_ASN1:decode error:asn_mime.c:192:
> 140473279682200:error:0D0D40CB:asn1 encoding 
> routines:SMIME_read_ASN1:asn1 parse error:asn_mime.c:517:
> org.apache.hadoop.fs.azure.AzureException: 
> org.apache.hadoop.fs.azure.KeyProviderException: ExitCodeException 
> exitCode=2: Error reading S/MIME message
> 140473279682200:error:0D07207B:asn1 encoding 
> routines:ASN1_get_object:header too long:asn1_lib.c:157:
> 140473279682200:error:0D0D106E:asn1 encoding 
> routines:B64_READ_ASN1:decode error:asn_mime.c:192:
> 140473279682200:error:0D0D40CB:asn1 encoding 
> routines:SMIME_read_ASN1:asn1 parse error:asn_mime.c:517:
>   at 
> org.apache.hadoop.fs.azure.AzureNativeFileSystemStore.createAzureStorageSession(AzureNativeFileSystemStore.java:953)
>   at 
> org.apache.hadoop.fs.azure.AzureNativeFileSystemStore.initialize(AzureNativeFileSystemStore.java:450)
>   at 
> org.apache.hadoop.fs.azure.NativeAzureFileSystem.initialize(NativeAzureFileSystem.java:1209)
>   at 
> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2761)
>   at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:99)
>   at 
> org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2795)
>   at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2777)
>   at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:386)
>   at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$12.apply(DataSource.scala:366)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$12.apply(DataSource.scala:364)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at 
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
>   at scala.collection.immutable.List.flatMap(List.scala:344)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:364)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:149)
>   at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:294)
>   at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:249)
>   at 
> taka.pipelines.AnomalyTrainingPipeline.main(AnomalyTrainingPipeline.java:35)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at 

[jira] [Commented] (SPARK-17755) Master may ask a worker to launch an executor before the worker actually got the response of registration

2016-12-19 Thread Shuai Lin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15763089#comment-15763089
 ] 

Shuai Lin commented on SPARK-17755:
---

A (sort-of) similar problem for coarse grained scheduler backends is reported 
in https://issues.apache.org/jira/browse/SPARK-18820 .

> Master may ask a worker to launch an executor before the worker actually got 
> the response of registration
> -
>
> Key: SPARK-17755
> URL: https://issues.apache.org/jira/browse/SPARK-17755
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Yin Huai
>Assignee: Shixiong Zhu
>
> I somehow saw a failed test {{org.apache.spark.DistributedSuite.caching in 
> memory, serialized, replicated}}. Its log shows that Spark master asked the 
> worker to launch an executor before the worker actually got the response of 
> registration. So, the master knew that the worker had been registered. But, 
> the worker did not know if it self had been registered. 
> {code}
> 16/09/30 14:53:53.681 dispatcher-event-loop-0 INFO Master: Registering worker 
> localhost:38262 with 1 cores, 1024.0 MB RAM
> 16/09/30 14:53:53.681 dispatcher-event-loop-0 INFO Master: Launching executor 
> app-20160930145353-/1 on worker worker-20160930145353-localhost-38262
> 16/09/30 14:53:53.682 dispatcher-event-loop-3 INFO 
> StandaloneAppClient$ClientEndpoint: Executor added: app-20160930145353-/1 
> on worker-20160930145353-localhost-38262 (localhost:38262) with 1 cores
> 16/09/30 14:53:53.683 dispatcher-event-loop-3 INFO 
> StandaloneSchedulerBackend: Granted executor ID app-20160930145353-/1 on 
> hostPort localhost:38262 with 1 cores, 1024.0 MB RAM
> 16/09/30 14:53:53.683 dispatcher-event-loop-0 WARN Worker: Invalid Master 
> (spark://localhost:46460) attempted to launch executor.
> 16/09/30 14:53:53.687 worker-register-master-threadpool-0 INFO Worker: 
> Successfully registered with master spark://localhost:46460
> {code}
> Then, seems the worker did not launch any executor. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18278) Support native submission of spark jobs to a kubernetes cluster

2016-12-13 Thread Shuai Lin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15747151#comment-15747151
 ] 

Shuai Lin commented on SPARK-18278:
---


bq. If I had to choose between maintaining a fork versus cleaning up the 
scheduler to make a public API, I would choose the latter in the interest of 
clarifying the relationship between the K8s effort and the mainline project, as 
well as for making the scheduler code cleaner in general. 

Adding support for pluggable scheduler backend in spark is cool. AFAIK there 
are some custom scheduler backends for spark, and they are using forked 
versions of spark due to the lack of pluggable scheduler backend support:

- [Two sigma's spark fork|https://github.com/twosigma/spark], which added 
scheduler support for their [Cook Scheduler|https://github.com/twosigma/Cook]
- IBM also has a custom "Spark Session Scheduler", [which they shared in last 
month's MesosCon 
Asia|https://mesosconasia2016.sched.com/event/8Tut/spark-session-scheduler-the-key-to-guaranteed-sla-of-spark-applications-for-multiple-users-on-mesos-yong-feng-ibm-canada-ltd]

bq. we could include the K8s scheduler in the Apache releases as an 
experimental feature, ignore its bugs and test failures for the next few 
releases (that is, problems in the K8s-related code should never block releases)

I'm afraid that doesn't sound like good practice


> Support native submission of spark jobs to a kubernetes cluster
> ---
>
> Key: SPARK-18278
> URL: https://issues.apache.org/jira/browse/SPARK-18278
> Project: Spark
>  Issue Type: Umbrella
>  Components: Build, Deploy, Documentation, Scheduler, Spark Core
>Reporter: Erik Erlandson
> Attachments: SPARK-18278 - Spark on Kubernetes Design Proposal.pdf
>
>
> A new Apache Spark sub-project that enables native support for submitting 
> Spark applications to a kubernetes cluster.   The submitted application runs 
> in a driver executing on a kubernetes pod, and executors lifecycles are also 
> managed as pods.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18820) Driver may send "LaunchTask" before executor receive "RegisteredExecutor"

2016-12-11 Thread Shuai Lin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15739970#comment-15739970
 ] 

Shuai Lin commented on SPARK-18820:
---

The driver first sends {{RegisteredExecutor}} message and then, if there is a 
task scheduled to run on this executor, sends the {{LaunchTask}} message, both 
through the same underlying netty channel. So I think the order is guaranteed, 
and the problem described would never happen.

> Driver may send "LaunchTask" before executor receive "RegisteredExecutor"
> -
>
> Key: SPARK-18820
> URL: https://issues.apache.org/jira/browse/SPARK-18820
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 1.6.3
> Environment: spark-1.6.3
>Reporter: jin xing
>
> CoarseGrainedSchedulerBackend will update executorDataMap after receiving 
> "RegisterExecutor", thus task scheduler may assign tasks on to this executor;
> If LaunchTask arrives at CoarseGrainedExecutorBackend before 
> RegisteredExecutor, it will result in NullPointerException and executor 
> backend will exit;
> Is it a bug? If so can I make a pr? I think driver should send "LaunchTask" 
> after "RegisteredExecutor" is already received.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18736) CreateMap allows non-unique keys

2016-12-06 Thread Shuai Lin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15727398#comment-15727398
 ] 

Shuai Lin commented on SPARK-18736:
---

Ok, sounds good to me.

> CreateMap allows non-unique keys
> 
>
> Key: SPARK-18736
> URL: https://issues.apache.org/jira/browse/SPARK-18736
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Eyal Farago
>  Labels: map, sql, types
>
> Spark-Sql, {{CreateMap}} does not enforce unique keys, i.e. it's possible to 
> create a map with two identical keys: 
> {noformat}
> CreateMap(Literal(1), Literal(11), Literal(1), Literal(12))
> {noformat}
> This does not behave like standard maps in common programming languages.
> proper behavior should be chosen:
> # first 'wins'
> # last 'wins'
> # runtime error.
> {{GetMapValue}} currently implements option #1. Even if this is the desired 
> behavior {{CreateMap}} should return a unique map.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18736) CreateMap allows non-unique keys

2016-12-06 Thread Shuai Lin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15726074#comment-15726074
 ] 

Shuai Lin commented on SPARK-18736:
---

If the keys are all literas, then we can detect and remove the duplicated keys 
during analysis.

But if there are non-literal keys, we can't detect this before the physical 
execution, e.g.:
{code}
spark.createDataFrame(
Seq(
(1, "aaa"),
(2, "bbb"),
(3, "ccc")
)).toDF("id", "name").registerTempTable("df")
sql("select map(name, id, 'aaa', -1) as m from df").show()
{code}

So I think we can do this in two places:

* When preparing the {{keys}} and {{values}} expressions, we can remove all 
duplicated literal keys. 
* When doing codegen, we can add logic to discard the duplicated keys if there 
is any (e.g. by tracking the keys in a set)

[~hvanhovell] Does it sound good?

> CreateMap allows non-unique keys
> 
>
> Key: SPARK-18736
> URL: https://issues.apache.org/jira/browse/SPARK-18736
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Eyal Farago
>  Labels: map, sql, types
>
> Spark-Sql, {{CreateMap}} does not enforce unique keys, i.e. it's possible to 
> create a map with two identical keys: 
> {noformat}
> CreateMap(Literal(1), Literal(11), Literal(1), Literal(12))
> {noformat}
> This does not behave like standard maps in common programming languages.
> proper behavior should be chosen:
> # first 'wins'
> # last 'wins'
> # runtime error.
> {{GetMapValue}} currently implements option #1. Even if this is the desired 
> behavior {{CreateMap}} should return a unique map.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18736) CreateMap allows non-unique keys

2016-12-06 Thread Shuai Lin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15725579#comment-15725579
 ] 

Shuai Lin commented on SPARK-18736:
---

I can work on this.

> CreateMap allows non-unique keys
> 
>
> Key: SPARK-18736
> URL: https://issues.apache.org/jira/browse/SPARK-18736
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Eyal Farago
>  Labels: map, sql, types
>
> Spark-Sql, {{CreateMap}} does not enforce unique keys, i.e. it's possible to 
> create a map with two identical keys: 
> {noformat}
> CreateMap(Literal(1), Literal(11), Literal(1), Literal(12))
> {noformat}
> This does not behave like standard maps in common programming languages.
> proper behavior should be chosen:
> # first 'wins'
> # last 'wins'
> # runtime error.
> {{GetMapValue}} currently implements option #1. Even if this is the desired 
> behavior {{CreateMap}} should return a unique map.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18652) Include the example data and third-party licenses in pyspark package

2016-11-30 Thread Shuai Lin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shuai Lin updated SPARK-18652:
--
Description: 
Since we already include the python examples in the pyspark package, we should 
include the example data with it as well.

We should also include the third-party licences since we distribute their jars 
with the pyspark package.

  was:Since we already include the python examples in the pyspark package, we 
should include the example data with it as well.


> Include the example data and third-party licenses in pyspark package
> 
>
> Key: SPARK-18652
> URL: https://issues.apache.org/jira/browse/SPARK-18652
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Reporter: Shuai Lin
>Priority: Minor
>
> Since we already include the python examples in the pyspark package, we 
> should include the example data with it as well.
> We should also include the third-party licences since we distribute their 
> jars with the pyspark package.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18652) Include the example data and third-party licenses in pyspark package

2016-11-30 Thread Shuai Lin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shuai Lin updated SPARK-18652:
--
Summary: Include the example data and third-party licenses in pyspark 
package  (was: Include the example data with the pyspark package)

> Include the example data and third-party licenses in pyspark package
> 
>
> Key: SPARK-18652
> URL: https://issues.apache.org/jira/browse/SPARK-18652
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Reporter: Shuai Lin
>Priority: Minor
>
> Since we already include the python examples in the pyspark package, we 
> should include the example data with it as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18652) Include the example data with the pyspark package

2016-11-30 Thread Shuai Lin (JIRA)
Shuai Lin created SPARK-18652:
-

 Summary: Include the example data with the pyspark package
 Key: SPARK-18652
 URL: https://issues.apache.org/jira/browse/SPARK-18652
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Reporter: Shuai Lin
Priority: Minor


Since we already include the python examples in the pyspark package, we should 
include the example data with it as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18171) Show correct framework address in mesos master web ui when the advertised address is used

2016-10-30 Thread Shuai Lin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shuai Lin updated SPARK-18171:
--
Description: In [[SPARK-4563]] we added the support for the driver to 
advertise a different hostname/ip ({{spark.driver.host}} to the executors other 
than the hostname/ip the driver actually binds to 
({{spark.driver.bindAddress}}). But in the mesos webui's frameworks page, it 
still shows the driver's binds hostname/ip (though the web ui link is correct). 
We should fix it to make them consistent.  (was: In INF-4563 we added the 
support for the driver to advertise a different hostname/ip 
({{spark.driver.host}} to the executors other than the hostname/ip the driver 
actually binds to ({{spark.driver.bindAddress}}). But in the mesos webui's 
frameworks page, it still shows the driver's binds hostname/ip (though the web 
ui link is correct). We should fix it to make them consistent.)

> Show correct framework address in mesos master web ui when the advertised 
> address is used
> -
>
> Key: SPARK-18171
> URL: https://issues.apache.org/jira/browse/SPARK-18171
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Reporter: Shuai Lin
>Priority: Minor
>
> In [[SPARK-4563]] we added the support for the driver to advertise a 
> different hostname/ip ({{spark.driver.host}} to the executors other than the 
> hostname/ip the driver actually binds to ({{spark.driver.bindAddress}}). But 
> in the mesos webui's frameworks page, it still shows the driver's binds 
> hostname/ip (though the web ui link is correct). We should fix it to make 
> them consistent.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18171) Show correct framework address in mesos master web ui when the advertised address is used

2016-10-30 Thread Shuai Lin (JIRA)
Shuai Lin created SPARK-18171:
-

 Summary: Show correct framework address in mesos master web ui 
when the advertised address is used
 Key: SPARK-18171
 URL: https://issues.apache.org/jira/browse/SPARK-18171
 Project: Spark
  Issue Type: Improvement
  Components: Mesos
Reporter: Shuai Lin
Priority: Minor


In INF-4563 we added the support for the driver to advertise a different 
hostname/ip ({{spark.driver.host}} to the executors other than the hostname/ip 
the driver actually binds to ({{spark.driver.bindAddress}}). But in the mesos 
webui's frameworks page, it still shows the driver's binds hostname/ip (though 
the web ui link is correct). We should fix it to make them consistent.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4563) Allow spark driver to bind to different ip then advertise ip

2016-10-29 Thread Shuai Lin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15618022#comment-15618022
 ] 

Shuai Lin commented on SPARK-4563:
--

To do that i think we need to add two extra options: 
{{spark.driver.advertisePort}} and {{spark.driver.blockManager.advertisePort}}, 
and pass them to the executors (instead of {{spark.driver.port}} and 
{{spark.driver.blockManager.port}}) when present.

> Allow spark driver to bind to different ip then advertise ip
> 
>
> Key: SPARK-4563
> URL: https://issues.apache.org/jira/browse/SPARK-4563
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Reporter: Long Nguyen
>Assignee: Marcelo Vanzin
>Priority: Minor
> Fix For: 2.1.0
>
>
> Spark driver bind ip and advertise is not configurable. spark.driver.host is 
> only bind ip. SPARK_PUBLIC_DNS does not work for spark driver. Allow option 
> to set advertised ip/hostname



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17940) Typo in LAST function error message

2016-10-14 Thread Shuai Lin (JIRA)
Shuai Lin created SPARK-17940:
-

 Summary: Typo in LAST function error message
 Key: SPARK-17940
 URL: https://issues.apache.org/jira/browse/SPARK-17940
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Shuai Lin
Priority: Minor


https://github.com/apache/spark/blob/v2.0.1/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Last.scala#L40

{code}
  throw new AnalysisException("The second argument of First should be a 
boolean literal.")
{code} 

"First" should be "Last".

Also the usage string can be improved to match the FIRST function.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17802) Lots of "java.lang.ClassNotFoundException: org.apache.hadoop.ipc.CallerContext" In spark logs

2016-10-06 Thread Shuai Lin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shuai Lin updated SPARK-17802:
--
Description: 
SPARK-16757 sets the hadoop {{CallerContext}} when calling hadoop/hdfs apis to 
make spark applications more diagnosable in hadoop/hdfs logs. However, the 
{{CallerContext}} is only added since [hadoop 
2.8|https://issues.apache.org/jira/browse/HDFS-9184], which is not officially 
releaed yet. So each time {{utils.CallerContext.setCurrentContext()}} is called 
(e.g [when a task is 
created|https://github.com/apache/spark/blob/b678e46/core/src/main/scala/org/apache/spark/scheduler/Task.scala#L95-L96]),
 a "java.lang.ClassNotFoundException: org.apache.hadoop.ipc.CallerContext"
  error is logged, which pollutes the spark logs when there are lots of tasks.

We should improve this so it's only logged once.


  was:
SPARK-16757 sets the hadoop {{CallerContext}} when calling hadoop/hdfs apis to 
make spark applications more diagnosable in hadoop/hdfs logs. However, the 
{{CallerContext}} is only added since [hadoop 
2.8|https://issues.apache.org/jira/browse/HDFS-9184], which is not even 
officially releaed yet. So each time 
{{utils.CallerContext.setCurrentContext()}} is called (e.g [when a task is 
created|https://github.com/apache/spark/blob/b678e46/core/src/main/scala/org/apache/spark/scheduler/Task.scala#L95-L96]),
 a "java.lang.ClassNotFoundException: org.apache.hadoop.ipc.CallerContext"
  error is logged, which pollutes the spark logs when there are lots of tasks.

We should improve this so it's only logged once.



> Lots of "java.lang.ClassNotFoundException: 
> org.apache.hadoop.ipc.CallerContext" In spark logs
> -
>
> Key: SPARK-17802
> URL: https://issues.apache.org/jira/browse/SPARK-17802
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Shuai Lin
>Priority: Minor
>
> SPARK-16757 sets the hadoop {{CallerContext}} when calling hadoop/hdfs apis 
> to make spark applications more diagnosable in hadoop/hdfs logs. However, the 
> {{CallerContext}} is only added since [hadoop 
> 2.8|https://issues.apache.org/jira/browse/HDFS-9184], which is not officially 
> releaed yet. So each time {{utils.CallerContext.setCurrentContext()}} is 
> called (e.g [when a task is 
> created|https://github.com/apache/spark/blob/b678e46/core/src/main/scala/org/apache/spark/scheduler/Task.scala#L95-L96]),
>  a "java.lang.ClassNotFoundException: org.apache.hadoop.ipc.CallerContext"
>   error is logged, which pollutes the spark logs when there are lots of tasks.
> We should improve this so it's only logged once.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17802) Lots of "java.lang.ClassNotFoundException: org.apache.hadoop.ipc.CallerContext" In spark logs

2016-10-06 Thread Shuai Lin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shuai Lin updated SPARK-17802:
--
Description: 
SPARK-16757 sets the hadoop {{CallerContext}} when calling hadoop/hdfs apis to 
make spark applications more diagnosable in hadoop/hdfs logs. However, the 
{{CallerContext}} is only added since [hadoop 
2.8|https://issues.apache.org/jira/browse/HDFS-9184], which is not even 
officially releaed yet. So each time 
{{utils.CallerContext.setCurrentContext()}} is called (e.g [when a task is 
created|https://github.com/apache/spark/blob/b678e46/core/src/main/scala/org/apache/spark/scheduler/Task.scala#L95-L96]),
 a "java.lang.ClassNotFoundException: org.apache.hadoop.ipc.CallerContext"
  error is logged, which pollutes the spark logs when there are lots of tasks.

We should improve this so it's only logged once.


  was:
SPARK-16757 sets the hadoop {{CallerContext}} when calling hadoop/hdfs apis to 
make spark applications more diagnosable in hadoop/hdfs logs. However, the 
{{CallerContext}} is only added since [hadoop 
2.8|https://issues.apache.org/jira/browse/HDFS-9184]. So each time 
{{utils.CallerContext.setCurrentContext()}} is called (e.g [when a task is 
created|https://github.com/apache/spark/blob/b678e46/core/src/main/scala/org/apache/spark/scheduler/Task.scala#L95-L96]),
 a "java.lang.ClassNotFoundException: org.apache.hadoop.ipc.CallerContext"
  error is logged, which pollutes the spark logs when there are lots of tasks.

We should improve this so it's only logged once.



> Lots of "java.lang.ClassNotFoundException: 
> org.apache.hadoop.ipc.CallerContext" In spark logs
> -
>
> Key: SPARK-17802
> URL: https://issues.apache.org/jira/browse/SPARK-17802
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Shuai Lin
>Priority: Minor
>
> SPARK-16757 sets the hadoop {{CallerContext}} when calling hadoop/hdfs apis 
> to make spark applications more diagnosable in hadoop/hdfs logs. However, the 
> {{CallerContext}} is only added since [hadoop 
> 2.8|https://issues.apache.org/jira/browse/HDFS-9184], which is not even 
> officially releaed yet. So each time 
> {{utils.CallerContext.setCurrentContext()}} is called (e.g [when a task is 
> created|https://github.com/apache/spark/blob/b678e46/core/src/main/scala/org/apache/spark/scheduler/Task.scala#L95-L96]),
>  a "java.lang.ClassNotFoundException: org.apache.hadoop.ipc.CallerContext"
>   error is logged, which pollutes the spark logs when there are lots of tasks.
> We should improve this so it's only logged once.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17802) Lots of "java.lang.ClassNotFoundException: org.apache.hadoop.ipc.CallerContext" In spark logs

2016-10-06 Thread Shuai Lin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shuai Lin updated SPARK-17802:
--
Description: 
SPARK-16757 sets the hadoop {{CallerContext}} when calling hadoop/hdfs apis to 
make spark applications more diagnosable in hadoop/hdfs logs. However, the 
{{CallerContext}} is only added since [hadoop 
2.8|https://issues.apache.org/jira/browse/HDFS-9184]. So each time 
{{utils.CallerContext.setCurrentContext()}} is called (e.g [when a task is 
created|https://github.com/apache/spark/blob/b678e46/core/src/main/scala/org/apache/spark/scheduler/Task.scala#L95-L96]),
 a "java.lang.ClassNotFoundException: org.apache.hadoop.ipc.CallerContext"
  error is logged, which pollutes the spark logs when there are lots of tasks.

We should improve this so it's only logged once.


  was:
SPARK-16757 sets the hadoop {{CallerContext}} when calling hadoop/hdfs apis to 
make spark applications more diagnosable in hadoop/hdfs logs. However, the 
{{CallerContext}} is only added since hadoop 2.8. So each time 
{{utils.CallerContext.setCurrentContext()}} is called (e.g when a task is 
created), a "java.lang.ClassNotFoundException: 
org.apache.hadoop.ipc.CallerContext"
  error is logged.

We should improve this so it's only logged once.



> Lots of "java.lang.ClassNotFoundException: 
> org.apache.hadoop.ipc.CallerContext" In spark logs
> -
>
> Key: SPARK-17802
> URL: https://issues.apache.org/jira/browse/SPARK-17802
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Shuai Lin
>Priority: Minor
>
> SPARK-16757 sets the hadoop {{CallerContext}} when calling hadoop/hdfs apis 
> to make spark applications more diagnosable in hadoop/hdfs logs. However, the 
> {{CallerContext}} is only added since [hadoop 
> 2.8|https://issues.apache.org/jira/browse/HDFS-9184]. So each time 
> {{utils.CallerContext.setCurrentContext()}} is called (e.g [when a task is 
> created|https://github.com/apache/spark/blob/b678e46/core/src/main/scala/org/apache/spark/scheduler/Task.scala#L95-L96]),
>  a "java.lang.ClassNotFoundException: org.apache.hadoop.ipc.CallerContext"
>   error is logged, which pollutes the spark logs when there are lots of tasks.
> We should improve this so it's only logged once.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17802) Lots of "java.lang.ClassNotFoundException: org.apache.hadoop.ipc.CallerContext" In spark logs

2016-10-06 Thread Shuai Lin (JIRA)
Shuai Lin created SPARK-17802:
-

 Summary: Lots of "java.lang.ClassNotFoundException: 
org.apache.hadoop.ipc.CallerContext" In spark logs
 Key: SPARK-17802
 URL: https://issues.apache.org/jira/browse/SPARK-17802
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Shuai Lin
Priority: Minor


SPARK-16757 sets the hadoop {{CallerContext}} when calling hadoop/hdfs apis to 
make spark applications more diagnosable in hadoop/hdfs logs. However, the 
{{CallerContext}} is only added since hadoop 2.8. So each time 
{{utils.CallerContext.setCurrentContext()}} is called (e.g when a task is 
created), a "java.lang.ClassNotFoundException: 
org.apache.hadoop.ipc.CallerContext"
  error is logged.

We should improve this so it's only logged once.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17489) Improve filtering for bucketed tables

2016-09-09 Thread Shuai Lin (JIRA)
Shuai Lin created SPARK-17489:
-

 Summary: Improve filtering for bucketed tables
 Key: SPARK-17489
 URL: https://issues.apache.org/jira/browse/SPARK-17489
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Shuai Lin


Datasource allows creation of bucketed tables, can we optimize the query 
planning when there is a filter on the bucketed column?

For example:

{code}
select * from bucked_table where bucketed_col = "foo"
{code}

Given the above query, spark should only load the bucket files corresponding to 
the bucket files of value "foo".

But the current implementation does load all the files. Here is a small program 
to demonstrate.

{code}
# bin/spark-shell --master="local[2]"

case class Foo(name: String, age: Int)
spark.createDataFrame(Seq(
  Foo("aaa", 1),
  Foo("aaa", 2), 
  Foo("bbb", 3), 
  Foo("bbb", 4)))
  .write
  .format("json")
  .mode("overwrite")
  .bucketBy(2, "name")
  .saveAsTable("foo")

spark.sql("select * from foo where name = 'aaa'").show()

{code}

Then use sysdig to capture the file read events:

{code}
$ sudo sysdig -A -p "*%evt.time %evt.buffer" "fd.name contains spark-warehouse" 
and "evt.buffer contains bbb"  

05:36:59.430426611 
{\"name\":\"bbb\",\"age\":3}
{\"name\":\"bbb\",\"age\":4}
{code}

Sysdig shows the bucket files that obviously doesn't match the filter (name = 
"aaa") are also read by spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17414) Set type is not supported for creating data frames

2016-09-07 Thread Shuai Lin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15472302#comment-15472302
 ] 

Shuai Lin commented on SPARK-17414:
---

So what type should {{Set}} be mapped to? {{ArrayType}}? That sounds sort of 
counter-intuitive.

> Set type is not supported for creating data frames
> --
>
> Key: SPARK-17414
> URL: https://issues.apache.org/jira/browse/SPARK-17414
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Emre Colak
>Priority: Minor
>
> For a case class that has a field of type Set, createDataFrame() method 
> throws an exception saying "Schema for type Set is not supported". Exception 
> is raised by the org.apache.spark.sql.catalyst.ScalaReflection class where 
> Array, Seq and Map types are supported but Set is not. It would be nice to 
> support Set here by default instead of having to write a custom Encoder.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16975) Spark-2.0.0 unable to infer schema for parquet data written by Spark-1.6.2

2016-08-09 Thread Shuai Lin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shuai Lin updated SPARK-16975:
--
Labels: parquet  (was: )

> Spark-2.0.0 unable to infer schema for parquet data written by Spark-1.6.2
> --
>
> Key: SPARK-16975
> URL: https://issues.apache.org/jira/browse/SPARK-16975
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.0.0
> Environment: Ubuntu Linux 14.04
>Reporter: immerrr again
>  Labels: parquet
>
> Spark-2.0.0 seems to have some problems reading a parquet dataset generated 
> by 1.6.2. 
> {code}
> In [80]: spark.read.parquet('/path/to/data')
> ...
> AnalysisException: u'Unable to infer schema for ParquetFormat at 
> /path/to/data. It must be specified manually;'
> {code}
> The dataset is ~150G and partitioned by _locality_code column. None of the 
> partitions are empty. I have narrowed the failing dataset to the first 32 
> partitions of the data:
> {code}
> In [82]: spark.read.parquet(*subdirs[:32])
> ...
> AnalysisException: u'Unable to infer schema for ParquetFormat at 
> /path/to/data/_locality_code=AQ,/path/to/data/_locality_code=AI. It must be 
> specified manually;'
> {code}
> Interestingly, it works OK if you remove any of the partitions from the list:
> {code}
> In [83]: for i in range(32): spark.read.parquet(*(subdirs[:i] + 
> subdirs[i+1:32]))
> {code}
> Another strange thing is that the schemas for the first and the last 31 
> partitions of the subset are identical:
> {code}
> In [84]: spark.read.parquet(*subdirs[:31]).schema.fields == 
> spark.read.parquet(*subdirs[1:32]).schema.fields
> Out[84]: True
> {code}
> Which got me interested and I tried this:
> {code}
> In [87]: spark.read.parquet(*([subdirs[0]] * 32))
> ...
> AnalysisException: u'Unable to infer schema for ParquetFormat at 
> /path/to/data/_locality_code=AQ,/path/to/data/_locality_code=AQ. It must be 
> specified manually;'
> In [88]: spark.read.parquet(*([subdirs[15]] * 32))
> ...
> AnalysisException: u'Unable to infer schema for ParquetFormat at 
> /path/to/data/_locality_code=AX,/path/to/data/_locality_code=AX. It must be 
> specified manually;'
> In [89]: spark.read.parquet(*([subdirs[31]] * 32))
> ...
> AnalysisException: u'Unable to infer schema for ParquetFormat at 
> /path/to/data/_locality_code=BE,/path/to/data/_locality_code=BE. It must be 
> specified manually;'
> {code}
> If I read the first partition, save it in 2.0 and try to read in the same 
> manner, everything is fine:
> {code}
> In [100]: spark.read.parquet(subdirs[0]).write.parquet('spark-2.0-test')
> 16/08/09 11:03:37 WARN ParquetRecordReader: Can not initialize counter due to 
> context is not a instance of TaskInputOutputContext, but is 
> org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
> In [101]: df = spark.read.parquet(*(['spark-2.0-test'] * 32))
> {code}
> I have originally posted it to user mailing list, but with the last 
> discoveries this clearly seems like a bug.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16822) Support latex in scaladoc with MathJax

2016-07-31 Thread Shuai Lin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15401145#comment-15401145
 ] 

Shuai Lin commented on SPARK-16822:
---

I'm working on it and will post a PR soon.

> Support latex in scaladoc with MathJax
> --
>
> Key: SPARK-16822
> URL: https://issues.apache.org/jira/browse/SPARK-16822
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Shuai Lin
>Priority: Minor
>
> The scaladoc of some classes (mainly ml/mllib classes) include math formulas, 
> but currently it renders very ugly, e.g. [the doc of the LogisticGradient 
> class|https://spark.apache.org/docs/2.0.0-preview/api/scala/index.html#org.apache.spark.mllib.optimization.LogisticGradient].
> We can improve this by including MathJax javascripts in the scaladocs page, 
> much like what we do for the markdown docs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16822) Support latex in scaladoc with MathJax

2016-07-31 Thread Shuai Lin (JIRA)
Shuai Lin created SPARK-16822:
-

 Summary: Support latex in scaladoc with MathJax
 Key: SPARK-16822
 URL: https://issues.apache.org/jira/browse/SPARK-16822
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Reporter: Shuai Lin
Priority: Minor


The scaladoc of some classes (mainly ml/mllib classes) include math formulas, 
but currently it renders very ugly, e.g. [the doc of the LogisticGradient 
class|https://spark.apache.org/docs/2.0.0-preview/api/scala/index.html#org.apache.spark.mllib.optimization.LogisticGradient].

We can improve this by including MathJax javascripts in the scaladocs page, 
much like what we do for the markdown docs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16485) Additional fixes to Mllib 2.0 documentation

2016-07-13 Thread Shuai Lin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15376291#comment-15376291
 ] 

Shuai Lin commented on SPARK-16485:
---

[~josephkb] I fixed a math formatting for {{MinMaxScaler}} doc for 
{{ml-features.md}} in the above PR, please take a look.

> Additional fixes to Mllib 2.0 documentation
> ---
>
> Key: SPARK-16485
> URL: https://issues.apache.org/jira/browse/SPARK-16485
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib, SparkR
>Reporter: Timothy Hunter
>Assignee: Joseph K. Bradley
> Fix For: 2.0.1, 2.1.0
>
>
> While reviewing the documentation of MLlib, I found some additional issues.
> Important issues that affect the binary signatures:
>  - GBTClassificationModel: all the setters should be overriden
>  - LogisticRegressionModel: setThreshold(s)
>  - RandomForestClassificationModel: all the setters should be overriden
>  - org.apache.spark.ml.stat.distribution.MultivariateGaussian is exposed but 
> most of the methods are private[ml] -> do we need to expose this class for 
> now?
> - GeneralizedLinearRegressionModel: linkObj, familyObj, familyAndLink should 
> not be exposed
> - sqlDataTypes: name does not follow conventions. Do we need to expose it?
> Issues that involve only documentation:
> - Evaluator:
>   1. inconsistent doc between evaluate and isLargerBetter
> - MinMaxScaler: math rendering
> - GeneralizedLinearRegressionSummary: aic doc is incorrect
> The reference documentation that was used was:
> http://people.apache.org/~pwendell/spark-releases/spark-2.0.0-rc2-docs/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16485) Additional fixes to Mllib 2.0 documentation

2016-07-13 Thread Shuai Lin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15374558#comment-15374558
 ] 

Shuai Lin commented on SPARK-16485:
---

[~timhunter] [~josephkb] I'm new in the spark community, may I create a sub 
task for the doc related changes mentioned in the description and work on it?

> Additional fixes to Mllib 2.0 documentation
> ---
>
> Key: SPARK-16485
> URL: https://issues.apache.org/jira/browse/SPARK-16485
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib, SparkR
>Reporter: Timothy Hunter
>
> While reviewing the documentation of MLlib, I found some additional issues.
> Important issues that affect the binary signatures:
>  - GBTClassificationModel: all the setters should be overriden
>  - LogisticRegressionModel: setThreshold(s)
>  - RandomForestClassificationModel: all the setters should be overriden
>  - org.apache.spark.ml.stat.distribution.MultivariateGaussian is exposed but 
> most of the methods are private[ml] -> do we need to expose this class for 
> now?
> - GeneralizedLinearRegressionModel: linkObj, familyObj, familyAndLink should 
> not be exposed
> - sqlDataTypes: name does not follow conventions. Do we need to expose it?
> Issues that involve only documentation:
> - Evaluator:
>   1. inconsistent doc between evaluate and isLargerBetter
> - MinMaxScaler: math rendering
> - GeneralizedLinearRegressionSummary: aic doc is incorrect
> The reference documentation that was used was:
> http://people.apache.org/~pwendell/spark-releases/spark-2.0.0-rc2-docs/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16490) Python mllib example for chi-squared feature selector

2016-07-11 Thread Shuai Lin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shuai Lin updated SPARK-16490:
--
Labels: starter  (was: )

> Python mllib example for chi-squared feature selector
> -
>
> Key: SPARK-16490
> URL: https://issues.apache.org/jira/browse/SPARK-16490
> Project: Spark
>  Issue Type: Task
>  Components: MLlib, PySpark
>Reporter: Shuai Lin
>Priority: Minor
>  Labels: starter
>
> There are java & scala examples for {{ChiSqSelector}} in mllib, but the 
> correspondent python example is missing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16490) Python mllib example for chi-squared feature selector

2016-07-11 Thread Shuai Lin (JIRA)
Shuai Lin created SPARK-16490:
-

 Summary: Python mllib example for chi-squared feature selector
 Key: SPARK-16490
 URL: https://issues.apache.org/jira/browse/SPARK-16490
 Project: Spark
  Issue Type: Task
  Components: MLlib, PySpark
Reporter: Shuai Lin
Priority: Minor


There are java & scala examples for {{ChiSqSelector}} in mllib, but the 
correspondent python example is missing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11938) Expose numFeatures in all ML PredictionModel for PySpark

2016-07-07 Thread Shuai Lin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15366000#comment-15366000
 ] 

Shuai Lin commented on SPARK-11938:
---

This ticket seems to be conflicting with [SPARK-15113].

> Expose numFeatures in all ML PredictionModel for PySpark
> 
>
> Key: SPARK-11938
> URL: https://issues.apache.org/jira/browse/SPARK-11938
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Reporter: Yanbo Liang
>Assignee: Kai Sasaki
>Priority: Minor
>
> SPARK-9715 provide support for numFeatures in all ML PredictionModel, we 
> should expose it at Python side. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15009) PySpark CountVectorizerModel should be able to construct from vocabulary list

2016-07-03 Thread Shuai Lin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15360714#comment-15360714
 ] 

Shuai Lin commented on SPARK-15009:
---

Hi [~bryanc] , what's the status of this ticket? I can work on it if you're ok 
with that.

> PySpark CountVectorizerModel should be able to construct from vocabulary list
> -
>
> Key: SPARK-15009
> URL: https://issues.apache.org/jira/browse/SPARK-15009
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: Bryan Cutler
>Priority: Minor
>
> Like the Scala version, PySpark CountVectorizerModel should be able to 
> construct the model from given a vocabulary list.
> For example
> {noformat}
> cvm = CountVectorizerModel(["a", "b", "c"])
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org