[jira] [Commented] (SPARK-18736) CreateMap allows non-unique keys
[ https://issues.apache.org/jira/browse/SPARK-18736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15851057#comment-15851057 ] Shuai Lin commented on SPARK-18736: --- [~eyalfa] How is it going on? I can work on this one if you're ok. > CreateMap allows non-unique keys > > > Key: SPARK-18736 > URL: https://issues.apache.org/jira/browse/SPARK-18736 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Eyal Farago > Labels: map, sql, types > > Spark-Sql, {{CreateMap}} does not enforce unique keys, i.e. it's possible to > create a map with two identical keys: > {noformat} > CreateMap(Literal(1), Literal(11), Literal(1), Literal(12)) > {noformat} > This does not behave like standard maps in common programming languages. > proper behavior should be chosen: > # first 'wins' > # last 'wins' > # runtime error. > {{GetMapValue}} currently implements option #1. Even if this is the desired > behavior {{CreateMap}} should return a unique map. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14098) Generate code that get a float/double value in each column from CachedBatch when DataFrame.cache() is called
[ https://issues.apache.org/jira/browse/SPARK-14098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15843996#comment-15843996 ] Shuai Lin commented on SPARK-14098: --- [~kiszk] Seems the title/description of this ticket is not on par with what is done in https://github.com/apache/spark/pull/15219 . Should we update the title/description here? > Generate code that get a float/double value in each column from CachedBatch > when DataFrame.cache() is called > > > Key: SPARK-14098 > URL: https://issues.apache.org/jira/browse/SPARK-14098 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Kazuaki Ishizaki > > When DataFrame.cache() is called, data is stored as column-oriented storage > in CachedBatch. The current Catalyst generates Java program to get a value of > a column from an InternalRow that is translated from CachedBatch. This issue > generates Java code to get a value of a column from CachedBatch. While a > column for a cache may be compressed, this issue handles float and double > types that are never compressed. > Other primitive types, whose column may be compressed, will be addressed in > another entry. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19153) DataFrameWriter.saveAsTable should work with hive format to create partitioned table
[ https://issues.apache.org/jira/browse/SPARK-19153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15824935#comment-15824935 ] Shuai Lin commented on SPARK-19153: --- Never mind. I can help review it instead. > DataFrameWriter.saveAsTable should work with hive format to create > partitioned table > > > Key: SPARK-19153 > URL: https://issues.apache.org/jira/browse/SPARK-19153 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Wenchen Fan > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19153) DataFrameWriter.saveAsTable should work with hive format to create partitioned table
[ https://issues.apache.org/jira/browse/SPARK-19153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15823552#comment-15823552 ] Shuai Lin commented on SPARK-19153: --- [~windpiger] I planned to sent a PR today, just to see you already did that. May I suggest you to left a comment before starting to work on a ticket, so we don't step on each other's toe? > DataFrameWriter.saveAsTable should work with hive format to create > partitioned table > > > Key: SPARK-19153 > URL: https://issues.apache.org/jira/browse/SPARK-19153 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Wenchen Fan > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19153) DataFrameWriter.saveAsTable should work with hive format to create partitioned table
[ https://issues.apache.org/jira/browse/SPARK-19153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15823141#comment-15823141 ] Shuai Lin commented on SPARK-19153: --- bq. To clarify, we want this feature in DataFrameWriter and the official CREATE TABLE SQL statement, the legacy CREATE TABLE hive syntax is not our goal. Thanks for the reply and I agree with this. But TBH I don't understand your opinion of whether the summary I given above is correct or not. Can you be more clear on it? > DataFrameWriter.saveAsTable should work with hive format to create > partitioned table > > > Key: SPARK-19153 > URL: https://issues.apache.org/jira/browse/SPARK-19153 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Wenchen Fan > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19153) DataFrameWriter.saveAsTable should work with hive format to create partitioned table
[ https://issues.apache.org/jira/browse/SPARK-19153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15823103#comment-15823103 ] Shuai Lin commented on SPARK-19153: --- I find it's quite straight forward to remove the restriction of partitioned-by for the {{create table t1 using hive partitioned by (c1,c2) as select ..."}} CTAS statement. But another problem comes up: the partition columns must be on the right most of the schema, otherwise the schema we stored in the table property of metastore (with the property key "spark.sql.sources.schema") would be inconsistent with the schema we read back from hive client api. The reason is, when creating a hive table in the metastore, the schema and partition columns are disjoint sets (as required by hive client api). And when we reading it back, we append the partition columns to the end of the schema to get the catalyst schema, i.e.: {code} // HiveClientImpl.scala val partCols = h.getPartCols.asScala.map(fromHiveColumn) val schema = StructType(h.getCols.asScala.map(fromHiveColumn) ++ partCols) {code} It's not a problem before we have the unified "create table" syntax, because in the old create hive table syntax we have to specify the normal columns and partition columns separately, e.g. {{create table t1 (id int, name string) partitioned by (dept string)}} . Now that we can create partitioned table using hive format, e.g. {{create table t1 (id int, name string, dept string) using hive partitioned by (name)}}, the partition column may not be the last columns, so I think we need to reorder the schema so the partition columns would be the last ones. This is consistent with data source tables, e.g. {code} scala> sql("create table t1 (id int, name string, dept string) using parquet partitioned by (name)") scala> spark.table("t1").schema.fields.map(_.name) res44: Array[String] = Array(id, dept, name) {code} [~cloud_fan] Does this sound good to you? > DataFrameWriter.saveAsTable should work with hive format to create > partitioned table > > > Key: SPARK-19153 > URL: https://issues.apache.org/jira/browse/SPARK-19153 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Wenchen Fan > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19153) DataFrameWriter.saveAsTable should work with hive format to create partitioned table
[ https://issues.apache.org/jira/browse/SPARK-19153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15822661#comment-15822661 ] Shuai Lin commented on SPARK-19153: --- I'm working on this ticket, thanks. > DataFrameWriter.saveAsTable should work with hive format to create > partitioned table > > > Key: SPARK-19153 > URL: https://issues.apache.org/jira/browse/SPARK-19153 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Wenchen Fan > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17101) Provide consistent format identifiers for TextFileFormat and ParquetFileFormat
[ https://issues.apache.org/jira/browse/SPARK-17101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15818648#comment-15818648 ] Shuai Lin commented on SPARK-17101: --- Seems this issue has already been resolved by https://github.com/apache/spark/pull/14680 ? cc [~rxin] > Provide consistent format identifiers for TextFileFormat and ParquetFileFormat > -- > > Key: SPARK-17101 > URL: https://issues.apache.org/jira/browse/SPARK-17101 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Jacek Laskowski >Priority: Trivial > > Define the format identifier that is used in {{Optimized Logical Plan}} in > {{explain}} for {{text}} file format. > {code} > scala> spark.read.text("people.csv").cache.explain(extended = true) > ... > == Optimized Logical Plan == > InMemoryRelation [value#24], true, 1, StorageLevel(disk, memory, > deserialized, 1 replicas) >+- *FileScan text [value#24] Batched: false, Format: > org.apache.spark.sql.execution.datasources.text.TextFileFormat@262e2c8c, > InputPaths: file:/Users/jacek/dev/oss/spark/people.csv, PartitionFilters: [], > PushedFilters: [], ReadSchema: struct > == Physical Plan == > InMemoryTableScan [value#24] >+- InMemoryRelation [value#24], true, 1, StorageLevel(disk, memory, > deserialized, 1 replicas) > +- *FileScan text [value#24] Batched: false, Format: > org.apache.spark.sql.execution.datasources.text.TextFileFormat@262e2c8c, > InputPaths: file:/Users/jacek/dev/oss/spark/people.csv, PartitionFilters: [], > PushedFilters: [], ReadSchema: struct > {code} > When you {{explain}} csv format you can see {{Format: CSV}}. > {code} > scala> spark.read.csv("people.csv").cache.explain(extended = true) > == Parsed Logical Plan == > Relation[_c0#39,_c1#40,_c2#41,_c3#42] csv > == Analyzed Logical Plan == > _c0: string, _c1: string, _c2: string, _c3: string > Relation[_c0#39,_c1#40,_c2#41,_c3#42] csv > == Optimized Logical Plan == > InMemoryRelation [_c0#39, _c1#40, _c2#41, _c3#42], true, 1, > StorageLevel(disk, memory, deserialized, 1 replicas) >+- *FileScan csv [_c0#39,_c1#40,_c2#41,_c3#42] Batched: false, Format: > CSV, InputPaths: file:/Users/jacek/dev/oss/spark/people.csv, > PartitionFilters: [], PushedFilters: [], ReadSchema: > struct<_c0:string,_c1:string,_c2:string,_c3:string> > == Physical Plan == > InMemoryTableScan [_c0#39, _c1#40, _c2#41, _c3#42] >+- InMemoryRelation [_c0#39, _c1#40, _c2#41, _c3#42], true, 1, > StorageLevel(disk, memory, deserialized, 1 replicas) > +- *FileScan csv [_c0#39,_c1#40,_c2#41,_c3#42] Batched: false, > Format: CSV, InputPaths: file:/Users/jacek/dev/oss/spark/people.csv, > PartitionFilters: [], PushedFilters: [], ReadSchema: > struct<_c0:string,_c1:string,_c2:string,_c3:string> > {code} > The custom format is defined for JSON, too. > {code} > scala> spark.read.json("people.csv").cache.explain(extended = true) > == Parsed Logical Plan == > Relation[_corrupt_record#93] json > == Analyzed Logical Plan == > _corrupt_record: string > Relation[_corrupt_record#93] json > == Optimized Logical Plan == > InMemoryRelation [_corrupt_record#93], true, 1, StorageLevel(disk, > memory, deserialized, 1 replicas) >+- *FileScan json [_corrupt_record#93] Batched: false, Format: JSON, > InputPaths: file:/Users/jacek/dev/oss/spark/people.csv, PartitionFilters: [], > PushedFilters: [], ReadSchema: struct<_corrupt_record:string> > == Physical Plan == > InMemoryTableScan [_corrupt_record#93] >+- InMemoryRelation [_corrupt_record#93], true, 1, StorageLevel(disk, > memory, deserialized, 1 replicas) > +- *FileScan json [_corrupt_record#93] Batched: false, Format: JSON, > InputPaths: file:/Users/jacek/dev/oss/spark/people.csv, PartitionFilters: [], > PushedFilters: [], ReadSchema: struct<_corrupt_record:string> > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19123) KeyProviderException when reading Azure Blobs from Apache Spark
[ https://issues.apache.org/jira/browse/SPARK-19123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shuai Lin updated SPARK-19123: -- Flags: (was: Important) > KeyProviderException when reading Azure Blobs from Apache Spark > --- > > Key: SPARK-19123 > URL: https://issues.apache.org/jira/browse/SPARK-19123 > Project: Spark > Issue Type: Question > Components: Input/Output, Java API >Affects Versions: 2.0.0 > Environment: Apache Spark 2.0.0 running on Azure HDInsight cluster > version 3.5 with Hadoop version 2.7.3 >Reporter: Saulo Ricci >Priority: Minor > Labels: newbie > > I created a Spark job and it's intended to read a set of json files from a > Azure Blob container. I set the key and reference to my storage and I'm > reading the files as showed in the snippet bellow: > {code:java} > SparkSession > sparkSession = > SparkSession.builder().appName("Pipeline") > .master("yarn") > .config("fs.azure", > "org.apache.hadoop.fs.azure.NativeAzureFileSystem") > > .config("fs.azure.account.key..blob.core.windows.net","") > .getOrCreate(); > Dataset txs = sparkSession.read().json("wasb://path_to_files"); > {code} > The point is that I'm unfortunately getting a > `org.apache.hadoop.fs.azure.KeyProviderException` when reading the blobs from > the azure storage. According to the trace showed bellow it seems the header > too long but still trying to figure out what exactly that means: > {code:java} > 17/01/07 19:28:39 ERROR ApplicationMaster: User class threw exception: > org.apache.hadoop.fs.azure.AzureException: > org.apache.hadoop.fs.azure.KeyProviderException: ExitCodeException > exitCode=2: Error reading S/MIME message > 140473279682200:error:0D07207B:asn1 encoding > routines:ASN1_get_object:header too long:asn1_lib.c:157: > 140473279682200:error:0D0D106E:asn1 encoding > routines:B64_READ_ASN1:decode error:asn_mime.c:192: > 140473279682200:error:0D0D40CB:asn1 encoding > routines:SMIME_read_ASN1:asn1 parse error:asn_mime.c:517: > org.apache.hadoop.fs.azure.AzureException: > org.apache.hadoop.fs.azure.KeyProviderException: ExitCodeException > exitCode=2: Error reading S/MIME message > 140473279682200:error:0D07207B:asn1 encoding > routines:ASN1_get_object:header too long:asn1_lib.c:157: > 140473279682200:error:0D0D106E:asn1 encoding > routines:B64_READ_ASN1:decode error:asn_mime.c:192: > 140473279682200:error:0D0D40CB:asn1 encoding > routines:SMIME_read_ASN1:asn1 parse error:asn_mime.c:517: > at > org.apache.hadoop.fs.azure.AzureNativeFileSystemStore.createAzureStorageSession(AzureNativeFileSystemStore.java:953) > at > org.apache.hadoop.fs.azure.AzureNativeFileSystemStore.initialize(AzureNativeFileSystemStore.java:450) > at > org.apache.hadoop.fs.azure.NativeAzureFileSystem.initialize(NativeAzureFileSystem.java:1209) > at > org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2761) > at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:99) > at > org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2795) > at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2777) > at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:386) > at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295) > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$12.apply(DataSource.scala:366) > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$12.apply(DataSource.scala:364) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at scala.collection.immutable.List.foreach(List.scala:381) > at > scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241) > at scala.collection.immutable.List.flatMap(List.scala:344) > at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:364) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:149) > at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:294) > at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:249) > at > taka.pipelines.AnomalyTrainingPipeline.main(AnomalyTrainingPipeline.java:35) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at
[jira] [Commented] (SPARK-19123) KeyProviderException when reading Azure Blobs from Apache Spark
[ https://issues.apache.org/jira/browse/SPARK-19123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15808663#comment-15808663 ] Shuai Lin commented on SPARK-19123: --- IIUC {{KeyProviderException}} means the storage account key is not configured properly. Are you sure the way you specify the key is correct? Have you checked the azure developer docs for it? BTW I don't think this is an "critical issue", so I changed it to "minor". > KeyProviderException when reading Azure Blobs from Apache Spark > --- > > Key: SPARK-19123 > URL: https://issues.apache.org/jira/browse/SPARK-19123 > Project: Spark > Issue Type: Question > Components: Input/Output, Java API >Affects Versions: 2.0.0 > Environment: Apache Spark 2.0.0 running on Azure HDInsight cluster > version 3.5 with Hadoop version 2.7.3 >Reporter: Saulo Ricci >Priority: Minor > Labels: newbie > > I created a Spark job and it's intended to read a set of json files from a > Azure Blob container. I set the key and reference to my storage and I'm > reading the files as showed in the snippet bellow: > {code:java} > SparkSession > sparkSession = > SparkSession.builder().appName("Pipeline") > .master("yarn") > .config("fs.azure", > "org.apache.hadoop.fs.azure.NativeAzureFileSystem") > > .config("fs.azure.account.key..blob.core.windows.net","") > .getOrCreate(); > Dataset txs = sparkSession.read().json("wasb://path_to_files"); > {code} > The point is that I'm unfortunately getting a > `org.apache.hadoop.fs.azure.KeyProviderException` when reading the blobs from > the azure storage. According to the trace showed bellow it seems the header > too long but still trying to figure out what exactly that means: > {code:java} > 17/01/07 19:28:39 ERROR ApplicationMaster: User class threw exception: > org.apache.hadoop.fs.azure.AzureException: > org.apache.hadoop.fs.azure.KeyProviderException: ExitCodeException > exitCode=2: Error reading S/MIME message > 140473279682200:error:0D07207B:asn1 encoding > routines:ASN1_get_object:header too long:asn1_lib.c:157: > 140473279682200:error:0D0D106E:asn1 encoding > routines:B64_READ_ASN1:decode error:asn_mime.c:192: > 140473279682200:error:0D0D40CB:asn1 encoding > routines:SMIME_read_ASN1:asn1 parse error:asn_mime.c:517: > org.apache.hadoop.fs.azure.AzureException: > org.apache.hadoop.fs.azure.KeyProviderException: ExitCodeException > exitCode=2: Error reading S/MIME message > 140473279682200:error:0D07207B:asn1 encoding > routines:ASN1_get_object:header too long:asn1_lib.c:157: > 140473279682200:error:0D0D106E:asn1 encoding > routines:B64_READ_ASN1:decode error:asn_mime.c:192: > 140473279682200:error:0D0D40CB:asn1 encoding > routines:SMIME_read_ASN1:asn1 parse error:asn_mime.c:517: > at > org.apache.hadoop.fs.azure.AzureNativeFileSystemStore.createAzureStorageSession(AzureNativeFileSystemStore.java:953) > at > org.apache.hadoop.fs.azure.AzureNativeFileSystemStore.initialize(AzureNativeFileSystemStore.java:450) > at > org.apache.hadoop.fs.azure.NativeAzureFileSystem.initialize(NativeAzureFileSystem.java:1209) > at > org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2761) > at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:99) > at > org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2795) > at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2777) > at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:386) > at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295) > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$12.apply(DataSource.scala:366) > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$12.apply(DataSource.scala:364) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at scala.collection.immutable.List.foreach(List.scala:381) > at > scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241) > at scala.collection.immutable.List.flatMap(List.scala:344) > at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:364) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:149) > at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:294) > at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:249) > at > taka.pipelines.AnomalyTrainingPipeline.main(AnomalyTrainingPipeline.java:35) > at
[jira] [Updated] (SPARK-19123) KeyProviderException when reading Azure Blobs from Apache Spark
[ https://issues.apache.org/jira/browse/SPARK-19123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shuai Lin updated SPARK-19123: -- Labels: newbie (was: features newbie test) > KeyProviderException when reading Azure Blobs from Apache Spark > --- > > Key: SPARK-19123 > URL: https://issues.apache.org/jira/browse/SPARK-19123 > Project: Spark > Issue Type: Question > Components: Input/Output, Java API >Affects Versions: 2.0.0 > Environment: Apache Spark 2.0.0 running on Azure HDInsight cluster > version 3.5 with Hadoop version 2.7.3 >Reporter: Saulo Ricci >Priority: Minor > Labels: newbie > > I created a Spark job and it's intended to read a set of json files from a > Azure Blob container. I set the key and reference to my storage and I'm > reading the files as showed in the snippet bellow: > {code:java} > SparkSession > sparkSession = > SparkSession.builder().appName("Pipeline") > .master("yarn") > .config("fs.azure", > "org.apache.hadoop.fs.azure.NativeAzureFileSystem") > > .config("fs.azure.account.key..blob.core.windows.net","") > .getOrCreate(); > Dataset txs = sparkSession.read().json("wasb://path_to_files"); > {code} > The point is that I'm unfortunately getting a > `org.apache.hadoop.fs.azure.KeyProviderException` when reading the blobs from > the azure storage. According to the trace showed bellow it seems the header > too long but still trying to figure out what exactly that means: > {code:java} > 17/01/07 19:28:39 ERROR ApplicationMaster: User class threw exception: > org.apache.hadoop.fs.azure.AzureException: > org.apache.hadoop.fs.azure.KeyProviderException: ExitCodeException > exitCode=2: Error reading S/MIME message > 140473279682200:error:0D07207B:asn1 encoding > routines:ASN1_get_object:header too long:asn1_lib.c:157: > 140473279682200:error:0D0D106E:asn1 encoding > routines:B64_READ_ASN1:decode error:asn_mime.c:192: > 140473279682200:error:0D0D40CB:asn1 encoding > routines:SMIME_read_ASN1:asn1 parse error:asn_mime.c:517: > org.apache.hadoop.fs.azure.AzureException: > org.apache.hadoop.fs.azure.KeyProviderException: ExitCodeException > exitCode=2: Error reading S/MIME message > 140473279682200:error:0D07207B:asn1 encoding > routines:ASN1_get_object:header too long:asn1_lib.c:157: > 140473279682200:error:0D0D106E:asn1 encoding > routines:B64_READ_ASN1:decode error:asn_mime.c:192: > 140473279682200:error:0D0D40CB:asn1 encoding > routines:SMIME_read_ASN1:asn1 parse error:asn_mime.c:517: > at > org.apache.hadoop.fs.azure.AzureNativeFileSystemStore.createAzureStorageSession(AzureNativeFileSystemStore.java:953) > at > org.apache.hadoop.fs.azure.AzureNativeFileSystemStore.initialize(AzureNativeFileSystemStore.java:450) > at > org.apache.hadoop.fs.azure.NativeAzureFileSystem.initialize(NativeAzureFileSystem.java:1209) > at > org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2761) > at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:99) > at > org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2795) > at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2777) > at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:386) > at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295) > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$12.apply(DataSource.scala:366) > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$12.apply(DataSource.scala:364) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at scala.collection.immutable.List.foreach(List.scala:381) > at > scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241) > at scala.collection.immutable.List.flatMap(List.scala:344) > at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:364) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:149) > at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:294) > at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:249) > at > taka.pipelines.AnomalyTrainingPipeline.main(AnomalyTrainingPipeline.java:35) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at
[jira] [Updated] (SPARK-19123) KeyProviderException when reading Azure Blobs from Apache Spark
[ https://issues.apache.org/jira/browse/SPARK-19123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shuai Lin updated SPARK-19123: -- Priority: Minor (was: Critical) > KeyProviderException when reading Azure Blobs from Apache Spark > --- > > Key: SPARK-19123 > URL: https://issues.apache.org/jira/browse/SPARK-19123 > Project: Spark > Issue Type: Question > Components: Input/Output, Java API >Affects Versions: 2.0.0 > Environment: Apache Spark 2.0.0 running on Azure HDInsight cluster > version 3.5 with Hadoop version 2.7.3 >Reporter: Saulo Ricci >Priority: Minor > Labels: features, newbie, test > > I created a Spark job and it's intended to read a set of json files from a > Azure Blob container. I set the key and reference to my storage and I'm > reading the files as showed in the snippet bellow: > {code:java} > SparkSession > sparkSession = > SparkSession.builder().appName("Pipeline") > .master("yarn") > .config("fs.azure", > "org.apache.hadoop.fs.azure.NativeAzureFileSystem") > > .config("fs.azure.account.key..blob.core.windows.net","") > .getOrCreate(); > Dataset txs = sparkSession.read().json("wasb://path_to_files"); > {code} > The point is that I'm unfortunately getting a > `org.apache.hadoop.fs.azure.KeyProviderException` when reading the blobs from > the azure storage. According to the trace showed bellow it seems the header > too long but still trying to figure out what exactly that means: > {code:java} > 17/01/07 19:28:39 ERROR ApplicationMaster: User class threw exception: > org.apache.hadoop.fs.azure.AzureException: > org.apache.hadoop.fs.azure.KeyProviderException: ExitCodeException > exitCode=2: Error reading S/MIME message > 140473279682200:error:0D07207B:asn1 encoding > routines:ASN1_get_object:header too long:asn1_lib.c:157: > 140473279682200:error:0D0D106E:asn1 encoding > routines:B64_READ_ASN1:decode error:asn_mime.c:192: > 140473279682200:error:0D0D40CB:asn1 encoding > routines:SMIME_read_ASN1:asn1 parse error:asn_mime.c:517: > org.apache.hadoop.fs.azure.AzureException: > org.apache.hadoop.fs.azure.KeyProviderException: ExitCodeException > exitCode=2: Error reading S/MIME message > 140473279682200:error:0D07207B:asn1 encoding > routines:ASN1_get_object:header too long:asn1_lib.c:157: > 140473279682200:error:0D0D106E:asn1 encoding > routines:B64_READ_ASN1:decode error:asn_mime.c:192: > 140473279682200:error:0D0D40CB:asn1 encoding > routines:SMIME_read_ASN1:asn1 parse error:asn_mime.c:517: > at > org.apache.hadoop.fs.azure.AzureNativeFileSystemStore.createAzureStorageSession(AzureNativeFileSystemStore.java:953) > at > org.apache.hadoop.fs.azure.AzureNativeFileSystemStore.initialize(AzureNativeFileSystemStore.java:450) > at > org.apache.hadoop.fs.azure.NativeAzureFileSystem.initialize(NativeAzureFileSystem.java:1209) > at > org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2761) > at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:99) > at > org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2795) > at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2777) > at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:386) > at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295) > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$12.apply(DataSource.scala:366) > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$12.apply(DataSource.scala:364) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at scala.collection.immutable.List.foreach(List.scala:381) > at > scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241) > at scala.collection.immutable.List.flatMap(List.scala:344) > at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:364) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:149) > at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:294) > at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:249) > at > taka.pipelines.AnomalyTrainingPipeline.main(AnomalyTrainingPipeline.java:35) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at
[jira] [Commented] (SPARK-17755) Master may ask a worker to launch an executor before the worker actually got the response of registration
[ https://issues.apache.org/jira/browse/SPARK-17755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15763089#comment-15763089 ] Shuai Lin commented on SPARK-17755: --- A (sort-of) similar problem for coarse grained scheduler backends is reported in https://issues.apache.org/jira/browse/SPARK-18820 . > Master may ask a worker to launch an executor before the worker actually got > the response of registration > - > > Key: SPARK-17755 > URL: https://issues.apache.org/jira/browse/SPARK-17755 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Yin Huai >Assignee: Shixiong Zhu > > I somehow saw a failed test {{org.apache.spark.DistributedSuite.caching in > memory, serialized, replicated}}. Its log shows that Spark master asked the > worker to launch an executor before the worker actually got the response of > registration. So, the master knew that the worker had been registered. But, > the worker did not know if it self had been registered. > {code} > 16/09/30 14:53:53.681 dispatcher-event-loop-0 INFO Master: Registering worker > localhost:38262 with 1 cores, 1024.0 MB RAM > 16/09/30 14:53:53.681 dispatcher-event-loop-0 INFO Master: Launching executor > app-20160930145353-/1 on worker worker-20160930145353-localhost-38262 > 16/09/30 14:53:53.682 dispatcher-event-loop-3 INFO > StandaloneAppClient$ClientEndpoint: Executor added: app-20160930145353-/1 > on worker-20160930145353-localhost-38262 (localhost:38262) with 1 cores > 16/09/30 14:53:53.683 dispatcher-event-loop-3 INFO > StandaloneSchedulerBackend: Granted executor ID app-20160930145353-/1 on > hostPort localhost:38262 with 1 cores, 1024.0 MB RAM > 16/09/30 14:53:53.683 dispatcher-event-loop-0 WARN Worker: Invalid Master > (spark://localhost:46460) attempted to launch executor. > 16/09/30 14:53:53.687 worker-register-master-threadpool-0 INFO Worker: > Successfully registered with master spark://localhost:46460 > {code} > Then, seems the worker did not launch any executor. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18278) Support native submission of spark jobs to a kubernetes cluster
[ https://issues.apache.org/jira/browse/SPARK-18278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15747151#comment-15747151 ] Shuai Lin commented on SPARK-18278: --- bq. If I had to choose between maintaining a fork versus cleaning up the scheduler to make a public API, I would choose the latter in the interest of clarifying the relationship between the K8s effort and the mainline project, as well as for making the scheduler code cleaner in general. Adding support for pluggable scheduler backend in spark is cool. AFAIK there are some custom scheduler backends for spark, and they are using forked versions of spark due to the lack of pluggable scheduler backend support: - [Two sigma's spark fork|https://github.com/twosigma/spark], which added scheduler support for their [Cook Scheduler|https://github.com/twosigma/Cook] - IBM also has a custom "Spark Session Scheduler", [which they shared in last month's MesosCon Asia|https://mesosconasia2016.sched.com/event/8Tut/spark-session-scheduler-the-key-to-guaranteed-sla-of-spark-applications-for-multiple-users-on-mesos-yong-feng-ibm-canada-ltd] bq. we could include the K8s scheduler in the Apache releases as an experimental feature, ignore its bugs and test failures for the next few releases (that is, problems in the K8s-related code should never block releases) I'm afraid that doesn't sound like good practice > Support native submission of spark jobs to a kubernetes cluster > --- > > Key: SPARK-18278 > URL: https://issues.apache.org/jira/browse/SPARK-18278 > Project: Spark > Issue Type: Umbrella > Components: Build, Deploy, Documentation, Scheduler, Spark Core >Reporter: Erik Erlandson > Attachments: SPARK-18278 - Spark on Kubernetes Design Proposal.pdf > > > A new Apache Spark sub-project that enables native support for submitting > Spark applications to a kubernetes cluster. The submitted application runs > in a driver executing on a kubernetes pod, and executors lifecycles are also > managed as pods. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18820) Driver may send "LaunchTask" before executor receive "RegisteredExecutor"
[ https://issues.apache.org/jira/browse/SPARK-18820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15739970#comment-15739970 ] Shuai Lin commented on SPARK-18820: --- The driver first sends {{RegisteredExecutor}} message and then, if there is a task scheduled to run on this executor, sends the {{LaunchTask}} message, both through the same underlying netty channel. So I think the order is guaranteed, and the problem described would never happen. > Driver may send "LaunchTask" before executor receive "RegisteredExecutor" > - > > Key: SPARK-18820 > URL: https://issues.apache.org/jira/browse/SPARK-18820 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 1.6.3 > Environment: spark-1.6.3 >Reporter: jin xing > > CoarseGrainedSchedulerBackend will update executorDataMap after receiving > "RegisterExecutor", thus task scheduler may assign tasks on to this executor; > If LaunchTask arrives at CoarseGrainedExecutorBackend before > RegisteredExecutor, it will result in NullPointerException and executor > backend will exit; > Is it a bug? If so can I make a pr? I think driver should send "LaunchTask" > after "RegisteredExecutor" is already received. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18736) CreateMap allows non-unique keys
[ https://issues.apache.org/jira/browse/SPARK-18736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15727398#comment-15727398 ] Shuai Lin commented on SPARK-18736: --- Ok, sounds good to me. > CreateMap allows non-unique keys > > > Key: SPARK-18736 > URL: https://issues.apache.org/jira/browse/SPARK-18736 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Eyal Farago > Labels: map, sql, types > > Spark-Sql, {{CreateMap}} does not enforce unique keys, i.e. it's possible to > create a map with two identical keys: > {noformat} > CreateMap(Literal(1), Literal(11), Literal(1), Literal(12)) > {noformat} > This does not behave like standard maps in common programming languages. > proper behavior should be chosen: > # first 'wins' > # last 'wins' > # runtime error. > {{GetMapValue}} currently implements option #1. Even if this is the desired > behavior {{CreateMap}} should return a unique map. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18736) CreateMap allows non-unique keys
[ https://issues.apache.org/jira/browse/SPARK-18736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15726074#comment-15726074 ] Shuai Lin commented on SPARK-18736: --- If the keys are all literas, then we can detect and remove the duplicated keys during analysis. But if there are non-literal keys, we can't detect this before the physical execution, e.g.: {code} spark.createDataFrame( Seq( (1, "aaa"), (2, "bbb"), (3, "ccc") )).toDF("id", "name").registerTempTable("df") sql("select map(name, id, 'aaa', -1) as m from df").show() {code} So I think we can do this in two places: * When preparing the {{keys}} and {{values}} expressions, we can remove all duplicated literal keys. * When doing codegen, we can add logic to discard the duplicated keys if there is any (e.g. by tracking the keys in a set) [~hvanhovell] Does it sound good? > CreateMap allows non-unique keys > > > Key: SPARK-18736 > URL: https://issues.apache.org/jira/browse/SPARK-18736 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Eyal Farago > Labels: map, sql, types > > Spark-Sql, {{CreateMap}} does not enforce unique keys, i.e. it's possible to > create a map with two identical keys: > {noformat} > CreateMap(Literal(1), Literal(11), Literal(1), Literal(12)) > {noformat} > This does not behave like standard maps in common programming languages. > proper behavior should be chosen: > # first 'wins' > # last 'wins' > # runtime error. > {{GetMapValue}} currently implements option #1. Even if this is the desired > behavior {{CreateMap}} should return a unique map. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18736) CreateMap allows non-unique keys
[ https://issues.apache.org/jira/browse/SPARK-18736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15725579#comment-15725579 ] Shuai Lin commented on SPARK-18736: --- I can work on this. > CreateMap allows non-unique keys > > > Key: SPARK-18736 > URL: https://issues.apache.org/jira/browse/SPARK-18736 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Eyal Farago > Labels: map, sql, types > > Spark-Sql, {{CreateMap}} does not enforce unique keys, i.e. it's possible to > create a map with two identical keys: > {noformat} > CreateMap(Literal(1), Literal(11), Literal(1), Literal(12)) > {noformat} > This does not behave like standard maps in common programming languages. > proper behavior should be chosen: > # first 'wins' > # last 'wins' > # runtime error. > {{GetMapValue}} currently implements option #1. Even if this is the desired > behavior {{CreateMap}} should return a unique map. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18652) Include the example data and third-party licenses in pyspark package
[ https://issues.apache.org/jira/browse/SPARK-18652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shuai Lin updated SPARK-18652: -- Description: Since we already include the python examples in the pyspark package, we should include the example data with it as well. We should also include the third-party licences since we distribute their jars with the pyspark package. was:Since we already include the python examples in the pyspark package, we should include the example data with it as well. > Include the example data and third-party licenses in pyspark package > > > Key: SPARK-18652 > URL: https://issues.apache.org/jira/browse/SPARK-18652 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Reporter: Shuai Lin >Priority: Minor > > Since we already include the python examples in the pyspark package, we > should include the example data with it as well. > We should also include the third-party licences since we distribute their > jars with the pyspark package. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18652) Include the example data and third-party licenses in pyspark package
[ https://issues.apache.org/jira/browse/SPARK-18652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shuai Lin updated SPARK-18652: -- Summary: Include the example data and third-party licenses in pyspark package (was: Include the example data with the pyspark package) > Include the example data and third-party licenses in pyspark package > > > Key: SPARK-18652 > URL: https://issues.apache.org/jira/browse/SPARK-18652 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Reporter: Shuai Lin >Priority: Minor > > Since we already include the python examples in the pyspark package, we > should include the example data with it as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18652) Include the example data with the pyspark package
Shuai Lin created SPARK-18652: - Summary: Include the example data with the pyspark package Key: SPARK-18652 URL: https://issues.apache.org/jira/browse/SPARK-18652 Project: Spark Issue Type: Sub-task Components: PySpark Reporter: Shuai Lin Priority: Minor Since we already include the python examples in the pyspark package, we should include the example data with it as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18171) Show correct framework address in mesos master web ui when the advertised address is used
[ https://issues.apache.org/jira/browse/SPARK-18171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shuai Lin updated SPARK-18171: -- Description: In [[SPARK-4563]] we added the support for the driver to advertise a different hostname/ip ({{spark.driver.host}} to the executors other than the hostname/ip the driver actually binds to ({{spark.driver.bindAddress}}). But in the mesos webui's frameworks page, it still shows the driver's binds hostname/ip (though the web ui link is correct). We should fix it to make them consistent. (was: In INF-4563 we added the support for the driver to advertise a different hostname/ip ({{spark.driver.host}} to the executors other than the hostname/ip the driver actually binds to ({{spark.driver.bindAddress}}). But in the mesos webui's frameworks page, it still shows the driver's binds hostname/ip (though the web ui link is correct). We should fix it to make them consistent.) > Show correct framework address in mesos master web ui when the advertised > address is used > - > > Key: SPARK-18171 > URL: https://issues.apache.org/jira/browse/SPARK-18171 > Project: Spark > Issue Type: Improvement > Components: Mesos >Reporter: Shuai Lin >Priority: Minor > > In [[SPARK-4563]] we added the support for the driver to advertise a > different hostname/ip ({{spark.driver.host}} to the executors other than the > hostname/ip the driver actually binds to ({{spark.driver.bindAddress}}). But > in the mesos webui's frameworks page, it still shows the driver's binds > hostname/ip (though the web ui link is correct). We should fix it to make > them consistent. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18171) Show correct framework address in mesos master web ui when the advertised address is used
Shuai Lin created SPARK-18171: - Summary: Show correct framework address in mesos master web ui when the advertised address is used Key: SPARK-18171 URL: https://issues.apache.org/jira/browse/SPARK-18171 Project: Spark Issue Type: Improvement Components: Mesos Reporter: Shuai Lin Priority: Minor In INF-4563 we added the support for the driver to advertise a different hostname/ip ({{spark.driver.host}} to the executors other than the hostname/ip the driver actually binds to ({{spark.driver.bindAddress}}). But in the mesos webui's frameworks page, it still shows the driver's binds hostname/ip (though the web ui link is correct). We should fix it to make them consistent. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4563) Allow spark driver to bind to different ip then advertise ip
[ https://issues.apache.org/jira/browse/SPARK-4563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15618022#comment-15618022 ] Shuai Lin commented on SPARK-4563: -- To do that i think we need to add two extra options: {{spark.driver.advertisePort}} and {{spark.driver.blockManager.advertisePort}}, and pass them to the executors (instead of {{spark.driver.port}} and {{spark.driver.blockManager.port}}) when present. > Allow spark driver to bind to different ip then advertise ip > > > Key: SPARK-4563 > URL: https://issues.apache.org/jira/browse/SPARK-4563 > Project: Spark > Issue Type: Improvement > Components: Deploy >Reporter: Long Nguyen >Assignee: Marcelo Vanzin >Priority: Minor > Fix For: 2.1.0 > > > Spark driver bind ip and advertise is not configurable. spark.driver.host is > only bind ip. SPARK_PUBLIC_DNS does not work for spark driver. Allow option > to set advertised ip/hostname -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17940) Typo in LAST function error message
Shuai Lin created SPARK-17940: - Summary: Typo in LAST function error message Key: SPARK-17940 URL: https://issues.apache.org/jira/browse/SPARK-17940 Project: Spark Issue Type: Improvement Components: SQL Reporter: Shuai Lin Priority: Minor https://github.com/apache/spark/blob/v2.0.1/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Last.scala#L40 {code} throw new AnalysisException("The second argument of First should be a boolean literal.") {code} "First" should be "Last". Also the usage string can be improved to match the FIRST function. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17802) Lots of "java.lang.ClassNotFoundException: org.apache.hadoop.ipc.CallerContext" In spark logs
[ https://issues.apache.org/jira/browse/SPARK-17802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shuai Lin updated SPARK-17802: -- Description: SPARK-16757 sets the hadoop {{CallerContext}} when calling hadoop/hdfs apis to make spark applications more diagnosable in hadoop/hdfs logs. However, the {{CallerContext}} is only added since [hadoop 2.8|https://issues.apache.org/jira/browse/HDFS-9184], which is not officially releaed yet. So each time {{utils.CallerContext.setCurrentContext()}} is called (e.g [when a task is created|https://github.com/apache/spark/blob/b678e46/core/src/main/scala/org/apache/spark/scheduler/Task.scala#L95-L96]), a "java.lang.ClassNotFoundException: org.apache.hadoop.ipc.CallerContext" error is logged, which pollutes the spark logs when there are lots of tasks. We should improve this so it's only logged once. was: SPARK-16757 sets the hadoop {{CallerContext}} when calling hadoop/hdfs apis to make spark applications more diagnosable in hadoop/hdfs logs. However, the {{CallerContext}} is only added since [hadoop 2.8|https://issues.apache.org/jira/browse/HDFS-9184], which is not even officially releaed yet. So each time {{utils.CallerContext.setCurrentContext()}} is called (e.g [when a task is created|https://github.com/apache/spark/blob/b678e46/core/src/main/scala/org/apache/spark/scheduler/Task.scala#L95-L96]), a "java.lang.ClassNotFoundException: org.apache.hadoop.ipc.CallerContext" error is logged, which pollutes the spark logs when there are lots of tasks. We should improve this so it's only logged once. > Lots of "java.lang.ClassNotFoundException: > org.apache.hadoop.ipc.CallerContext" In spark logs > - > > Key: SPARK-17802 > URL: https://issues.apache.org/jira/browse/SPARK-17802 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Shuai Lin >Priority: Minor > > SPARK-16757 sets the hadoop {{CallerContext}} when calling hadoop/hdfs apis > to make spark applications more diagnosable in hadoop/hdfs logs. However, the > {{CallerContext}} is only added since [hadoop > 2.8|https://issues.apache.org/jira/browse/HDFS-9184], which is not officially > releaed yet. So each time {{utils.CallerContext.setCurrentContext()}} is > called (e.g [when a task is > created|https://github.com/apache/spark/blob/b678e46/core/src/main/scala/org/apache/spark/scheduler/Task.scala#L95-L96]), > a "java.lang.ClassNotFoundException: org.apache.hadoop.ipc.CallerContext" > error is logged, which pollutes the spark logs when there are lots of tasks. > We should improve this so it's only logged once. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17802) Lots of "java.lang.ClassNotFoundException: org.apache.hadoop.ipc.CallerContext" In spark logs
[ https://issues.apache.org/jira/browse/SPARK-17802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shuai Lin updated SPARK-17802: -- Description: SPARK-16757 sets the hadoop {{CallerContext}} when calling hadoop/hdfs apis to make spark applications more diagnosable in hadoop/hdfs logs. However, the {{CallerContext}} is only added since [hadoop 2.8|https://issues.apache.org/jira/browse/HDFS-9184], which is not even officially releaed yet. So each time {{utils.CallerContext.setCurrentContext()}} is called (e.g [when a task is created|https://github.com/apache/spark/blob/b678e46/core/src/main/scala/org/apache/spark/scheduler/Task.scala#L95-L96]), a "java.lang.ClassNotFoundException: org.apache.hadoop.ipc.CallerContext" error is logged, which pollutes the spark logs when there are lots of tasks. We should improve this so it's only logged once. was: SPARK-16757 sets the hadoop {{CallerContext}} when calling hadoop/hdfs apis to make spark applications more diagnosable in hadoop/hdfs logs. However, the {{CallerContext}} is only added since [hadoop 2.8|https://issues.apache.org/jira/browse/HDFS-9184]. So each time {{utils.CallerContext.setCurrentContext()}} is called (e.g [when a task is created|https://github.com/apache/spark/blob/b678e46/core/src/main/scala/org/apache/spark/scheduler/Task.scala#L95-L96]), a "java.lang.ClassNotFoundException: org.apache.hadoop.ipc.CallerContext" error is logged, which pollutes the spark logs when there are lots of tasks. We should improve this so it's only logged once. > Lots of "java.lang.ClassNotFoundException: > org.apache.hadoop.ipc.CallerContext" In spark logs > - > > Key: SPARK-17802 > URL: https://issues.apache.org/jira/browse/SPARK-17802 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Shuai Lin >Priority: Minor > > SPARK-16757 sets the hadoop {{CallerContext}} when calling hadoop/hdfs apis > to make spark applications more diagnosable in hadoop/hdfs logs. However, the > {{CallerContext}} is only added since [hadoop > 2.8|https://issues.apache.org/jira/browse/HDFS-9184], which is not even > officially releaed yet. So each time > {{utils.CallerContext.setCurrentContext()}} is called (e.g [when a task is > created|https://github.com/apache/spark/blob/b678e46/core/src/main/scala/org/apache/spark/scheduler/Task.scala#L95-L96]), > a "java.lang.ClassNotFoundException: org.apache.hadoop.ipc.CallerContext" > error is logged, which pollutes the spark logs when there are lots of tasks. > We should improve this so it's only logged once. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17802) Lots of "java.lang.ClassNotFoundException: org.apache.hadoop.ipc.CallerContext" In spark logs
[ https://issues.apache.org/jira/browse/SPARK-17802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shuai Lin updated SPARK-17802: -- Description: SPARK-16757 sets the hadoop {{CallerContext}} when calling hadoop/hdfs apis to make spark applications more diagnosable in hadoop/hdfs logs. However, the {{CallerContext}} is only added since [hadoop 2.8|https://issues.apache.org/jira/browse/HDFS-9184]. So each time {{utils.CallerContext.setCurrentContext()}} is called (e.g [when a task is created|https://github.com/apache/spark/blob/b678e46/core/src/main/scala/org/apache/spark/scheduler/Task.scala#L95-L96]), a "java.lang.ClassNotFoundException: org.apache.hadoop.ipc.CallerContext" error is logged, which pollutes the spark logs when there are lots of tasks. We should improve this so it's only logged once. was: SPARK-16757 sets the hadoop {{CallerContext}} when calling hadoop/hdfs apis to make spark applications more diagnosable in hadoop/hdfs logs. However, the {{CallerContext}} is only added since hadoop 2.8. So each time {{utils.CallerContext.setCurrentContext()}} is called (e.g when a task is created), a "java.lang.ClassNotFoundException: org.apache.hadoop.ipc.CallerContext" error is logged. We should improve this so it's only logged once. > Lots of "java.lang.ClassNotFoundException: > org.apache.hadoop.ipc.CallerContext" In spark logs > - > > Key: SPARK-17802 > URL: https://issues.apache.org/jira/browse/SPARK-17802 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Shuai Lin >Priority: Minor > > SPARK-16757 sets the hadoop {{CallerContext}} when calling hadoop/hdfs apis > to make spark applications more diagnosable in hadoop/hdfs logs. However, the > {{CallerContext}} is only added since [hadoop > 2.8|https://issues.apache.org/jira/browse/HDFS-9184]. So each time > {{utils.CallerContext.setCurrentContext()}} is called (e.g [when a task is > created|https://github.com/apache/spark/blob/b678e46/core/src/main/scala/org/apache/spark/scheduler/Task.scala#L95-L96]), > a "java.lang.ClassNotFoundException: org.apache.hadoop.ipc.CallerContext" > error is logged, which pollutes the spark logs when there are lots of tasks. > We should improve this so it's only logged once. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17802) Lots of "java.lang.ClassNotFoundException: org.apache.hadoop.ipc.CallerContext" In spark logs
Shuai Lin created SPARK-17802: - Summary: Lots of "java.lang.ClassNotFoundException: org.apache.hadoop.ipc.CallerContext" In spark logs Key: SPARK-17802 URL: https://issues.apache.org/jira/browse/SPARK-17802 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Shuai Lin Priority: Minor SPARK-16757 sets the hadoop {{CallerContext}} when calling hadoop/hdfs apis to make spark applications more diagnosable in hadoop/hdfs logs. However, the {{CallerContext}} is only added since hadoop 2.8. So each time {{utils.CallerContext.setCurrentContext()}} is called (e.g when a task is created), a "java.lang.ClassNotFoundException: org.apache.hadoop.ipc.CallerContext" error is logged. We should improve this so it's only logged once. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17489) Improve filtering for bucketed tables
Shuai Lin created SPARK-17489: - Summary: Improve filtering for bucketed tables Key: SPARK-17489 URL: https://issues.apache.org/jira/browse/SPARK-17489 Project: Spark Issue Type: Improvement Components: SQL Reporter: Shuai Lin Datasource allows creation of bucketed tables, can we optimize the query planning when there is a filter on the bucketed column? For example: {code} select * from bucked_table where bucketed_col = "foo" {code} Given the above query, spark should only load the bucket files corresponding to the bucket files of value "foo". But the current implementation does load all the files. Here is a small program to demonstrate. {code} # bin/spark-shell --master="local[2]" case class Foo(name: String, age: Int) spark.createDataFrame(Seq( Foo("aaa", 1), Foo("aaa", 2), Foo("bbb", 3), Foo("bbb", 4))) .write .format("json") .mode("overwrite") .bucketBy(2, "name") .saveAsTable("foo") spark.sql("select * from foo where name = 'aaa'").show() {code} Then use sysdig to capture the file read events: {code} $ sudo sysdig -A -p "*%evt.time %evt.buffer" "fd.name contains spark-warehouse" and "evt.buffer contains bbb" 05:36:59.430426611 {\"name\":\"bbb\",\"age\":3} {\"name\":\"bbb\",\"age\":4} {code} Sysdig shows the bucket files that obviously doesn't match the filter (name = "aaa") are also read by spark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17414) Set type is not supported for creating data frames
[ https://issues.apache.org/jira/browse/SPARK-17414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15472302#comment-15472302 ] Shuai Lin commented on SPARK-17414: --- So what type should {{Set}} be mapped to? {{ArrayType}}? That sounds sort of counter-intuitive. > Set type is not supported for creating data frames > -- > > Key: SPARK-17414 > URL: https://issues.apache.org/jira/browse/SPARK-17414 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Emre Colak >Priority: Minor > > For a case class that has a field of type Set, createDataFrame() method > throws an exception saying "Schema for type Set is not supported". Exception > is raised by the org.apache.spark.sql.catalyst.ScalaReflection class where > Array, Seq and Map types are supported but Set is not. It would be nice to > support Set here by default instead of having to write a custom Encoder. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16975) Spark-2.0.0 unable to infer schema for parquet data written by Spark-1.6.2
[ https://issues.apache.org/jira/browse/SPARK-16975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shuai Lin updated SPARK-16975: -- Labels: parquet (was: ) > Spark-2.0.0 unable to infer schema for parquet data written by Spark-1.6.2 > -- > > Key: SPARK-16975 > URL: https://issues.apache.org/jira/browse/SPARK-16975 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 2.0.0 > Environment: Ubuntu Linux 14.04 >Reporter: immerrr again > Labels: parquet > > Spark-2.0.0 seems to have some problems reading a parquet dataset generated > by 1.6.2. > {code} > In [80]: spark.read.parquet('/path/to/data') > ... > AnalysisException: u'Unable to infer schema for ParquetFormat at > /path/to/data. It must be specified manually;' > {code} > The dataset is ~150G and partitioned by _locality_code column. None of the > partitions are empty. I have narrowed the failing dataset to the first 32 > partitions of the data: > {code} > In [82]: spark.read.parquet(*subdirs[:32]) > ... > AnalysisException: u'Unable to infer schema for ParquetFormat at > /path/to/data/_locality_code=AQ,/path/to/data/_locality_code=AI. It must be > specified manually;' > {code} > Interestingly, it works OK if you remove any of the partitions from the list: > {code} > In [83]: for i in range(32): spark.read.parquet(*(subdirs[:i] + > subdirs[i+1:32])) > {code} > Another strange thing is that the schemas for the first and the last 31 > partitions of the subset are identical: > {code} > In [84]: spark.read.parquet(*subdirs[:31]).schema.fields == > spark.read.parquet(*subdirs[1:32]).schema.fields > Out[84]: True > {code} > Which got me interested and I tried this: > {code} > In [87]: spark.read.parquet(*([subdirs[0]] * 32)) > ... > AnalysisException: u'Unable to infer schema for ParquetFormat at > /path/to/data/_locality_code=AQ,/path/to/data/_locality_code=AQ. It must be > specified manually;' > In [88]: spark.read.parquet(*([subdirs[15]] * 32)) > ... > AnalysisException: u'Unable to infer schema for ParquetFormat at > /path/to/data/_locality_code=AX,/path/to/data/_locality_code=AX. It must be > specified manually;' > In [89]: spark.read.parquet(*([subdirs[31]] * 32)) > ... > AnalysisException: u'Unable to infer schema for ParquetFormat at > /path/to/data/_locality_code=BE,/path/to/data/_locality_code=BE. It must be > specified manually;' > {code} > If I read the first partition, save it in 2.0 and try to read in the same > manner, everything is fine: > {code} > In [100]: spark.read.parquet(subdirs[0]).write.parquet('spark-2.0-test') > 16/08/09 11:03:37 WARN ParquetRecordReader: Can not initialize counter due to > context is not a instance of TaskInputOutputContext, but is > org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl > In [101]: df = spark.read.parquet(*(['spark-2.0-test'] * 32)) > {code} > I have originally posted it to user mailing list, but with the last > discoveries this clearly seems like a bug. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16822) Support latex in scaladoc with MathJax
[ https://issues.apache.org/jira/browse/SPARK-16822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15401145#comment-15401145 ] Shuai Lin commented on SPARK-16822: --- I'm working on it and will post a PR soon. > Support latex in scaladoc with MathJax > -- > > Key: SPARK-16822 > URL: https://issues.apache.org/jira/browse/SPARK-16822 > Project: Spark > Issue Type: Improvement > Components: Documentation >Reporter: Shuai Lin >Priority: Minor > > The scaladoc of some classes (mainly ml/mllib classes) include math formulas, > but currently it renders very ugly, e.g. [the doc of the LogisticGradient > class|https://spark.apache.org/docs/2.0.0-preview/api/scala/index.html#org.apache.spark.mllib.optimization.LogisticGradient]. > We can improve this by including MathJax javascripts in the scaladocs page, > much like what we do for the markdown docs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16822) Support latex in scaladoc with MathJax
Shuai Lin created SPARK-16822: - Summary: Support latex in scaladoc with MathJax Key: SPARK-16822 URL: https://issues.apache.org/jira/browse/SPARK-16822 Project: Spark Issue Type: Improvement Components: Documentation Reporter: Shuai Lin Priority: Minor The scaladoc of some classes (mainly ml/mllib classes) include math formulas, but currently it renders very ugly, e.g. [the doc of the LogisticGradient class|https://spark.apache.org/docs/2.0.0-preview/api/scala/index.html#org.apache.spark.mllib.optimization.LogisticGradient]. We can improve this by including MathJax javascripts in the scaladocs page, much like what we do for the markdown docs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16485) Additional fixes to Mllib 2.0 documentation
[ https://issues.apache.org/jira/browse/SPARK-16485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15376291#comment-15376291 ] Shuai Lin commented on SPARK-16485: --- [~josephkb] I fixed a math formatting for {{MinMaxScaler}} doc for {{ml-features.md}} in the above PR, please take a look. > Additional fixes to Mllib 2.0 documentation > --- > > Key: SPARK-16485 > URL: https://issues.apache.org/jira/browse/SPARK-16485 > Project: Spark > Issue Type: Sub-task > Components: Documentation, GraphX, ML, MLlib, SparkR >Reporter: Timothy Hunter >Assignee: Joseph K. Bradley > Fix For: 2.0.1, 2.1.0 > > > While reviewing the documentation of MLlib, I found some additional issues. > Important issues that affect the binary signatures: > - GBTClassificationModel: all the setters should be overriden > - LogisticRegressionModel: setThreshold(s) > - RandomForestClassificationModel: all the setters should be overriden > - org.apache.spark.ml.stat.distribution.MultivariateGaussian is exposed but > most of the methods are private[ml] -> do we need to expose this class for > now? > - GeneralizedLinearRegressionModel: linkObj, familyObj, familyAndLink should > not be exposed > - sqlDataTypes: name does not follow conventions. Do we need to expose it? > Issues that involve only documentation: > - Evaluator: > 1. inconsistent doc between evaluate and isLargerBetter > - MinMaxScaler: math rendering > - GeneralizedLinearRegressionSummary: aic doc is incorrect > The reference documentation that was used was: > http://people.apache.org/~pwendell/spark-releases/spark-2.0.0-rc2-docs/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16485) Additional fixes to Mllib 2.0 documentation
[ https://issues.apache.org/jira/browse/SPARK-16485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15374558#comment-15374558 ] Shuai Lin commented on SPARK-16485: --- [~timhunter] [~josephkb] I'm new in the spark community, may I create a sub task for the doc related changes mentioned in the description and work on it? > Additional fixes to Mllib 2.0 documentation > --- > > Key: SPARK-16485 > URL: https://issues.apache.org/jira/browse/SPARK-16485 > Project: Spark > Issue Type: Sub-task > Components: Documentation, GraphX, ML, MLlib, SparkR >Reporter: Timothy Hunter > > While reviewing the documentation of MLlib, I found some additional issues. > Important issues that affect the binary signatures: > - GBTClassificationModel: all the setters should be overriden > - LogisticRegressionModel: setThreshold(s) > - RandomForestClassificationModel: all the setters should be overriden > - org.apache.spark.ml.stat.distribution.MultivariateGaussian is exposed but > most of the methods are private[ml] -> do we need to expose this class for > now? > - GeneralizedLinearRegressionModel: linkObj, familyObj, familyAndLink should > not be exposed > - sqlDataTypes: name does not follow conventions. Do we need to expose it? > Issues that involve only documentation: > - Evaluator: > 1. inconsistent doc between evaluate and isLargerBetter > - MinMaxScaler: math rendering > - GeneralizedLinearRegressionSummary: aic doc is incorrect > The reference documentation that was used was: > http://people.apache.org/~pwendell/spark-releases/spark-2.0.0-rc2-docs/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16490) Python mllib example for chi-squared feature selector
[ https://issues.apache.org/jira/browse/SPARK-16490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shuai Lin updated SPARK-16490: -- Labels: starter (was: ) > Python mllib example for chi-squared feature selector > - > > Key: SPARK-16490 > URL: https://issues.apache.org/jira/browse/SPARK-16490 > Project: Spark > Issue Type: Task > Components: MLlib, PySpark >Reporter: Shuai Lin >Priority: Minor > Labels: starter > > There are java & scala examples for {{ChiSqSelector}} in mllib, but the > correspondent python example is missing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16490) Python mllib example for chi-squared feature selector
Shuai Lin created SPARK-16490: - Summary: Python mllib example for chi-squared feature selector Key: SPARK-16490 URL: https://issues.apache.org/jira/browse/SPARK-16490 Project: Spark Issue Type: Task Components: MLlib, PySpark Reporter: Shuai Lin Priority: Minor There are java & scala examples for {{ChiSqSelector}} in mllib, but the correspondent python example is missing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11938) Expose numFeatures in all ML PredictionModel for PySpark
[ https://issues.apache.org/jira/browse/SPARK-11938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15366000#comment-15366000 ] Shuai Lin commented on SPARK-11938: --- This ticket seems to be conflicting with [SPARK-15113]. > Expose numFeatures in all ML PredictionModel for PySpark > > > Key: SPARK-11938 > URL: https://issues.apache.org/jira/browse/SPARK-11938 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Reporter: Yanbo Liang >Assignee: Kai Sasaki >Priority: Minor > > SPARK-9715 provide support for numFeatures in all ML PredictionModel, we > should expose it at Python side. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15009) PySpark CountVectorizerModel should be able to construct from vocabulary list
[ https://issues.apache.org/jira/browse/SPARK-15009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15360714#comment-15360714 ] Shuai Lin commented on SPARK-15009: --- Hi [~bryanc] , what's the status of this ticket? I can work on it if you're ok with that. > PySpark CountVectorizerModel should be able to construct from vocabulary list > - > > Key: SPARK-15009 > URL: https://issues.apache.org/jira/browse/SPARK-15009 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Reporter: Bryan Cutler >Priority: Minor > > Like the Scala version, PySpark CountVectorizerModel should be able to > construct the model from given a vocabulary list. > For example > {noformat} > cvm = CountVectorizerModel(["a", "b", "c"]) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org