[jira] [Assigned] (SPARK-21278) Upgrade to Py4J 0.10.5

2017-07-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21278:


Assignee: Apache Spark

> Upgrade to Py4J 0.10.5
> --
>
> Key: SPARK-21278
> URL: https://issues.apache.org/jira/browse/SPARK-21278
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>
> This issue aims to bump Py4J in order to fix the following float/double bug.
> Py4J 0.10.5 fixes this (https://github.com/bartdag/py4j/issues/272).
> {code}
> >>> df = spark.range(1)
> >>> df.select(df['id'] + 17.133574204226083).show()
> ++
> |(id + 17.1335742042)|
> ++
> |   17.1335742042|
> ++
> {code}
> {code}
> >>> df = spark.range(1)
> >>> df.select(df['id'] + 17.133574204226083).show()
> +-+
> |(id + 17.133574204226083)|
> +-+
> |   17.133574204226083|
> +-+
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21278) Upgrade to Py4J 0.10.5

2017-07-01 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16071493#comment-16071493
 ] 

Apache Spark commented on SPARK-21278:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/18502

> Upgrade to Py4J 0.10.5
> --
>
> Key: SPARK-21278
> URL: https://issues.apache.org/jira/browse/SPARK-21278
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Dongjoon Hyun
>
> This issue aims to bump Py4J in order to fix the following float/double bug.
> Py4J 0.10.5 fixes this (https://github.com/bartdag/py4j/issues/272).
> {code}
> >>> df = spark.range(1)
> >>> df.select(df['id'] + 17.133574204226083).show()
> ++
> |(id + 17.1335742042)|
> ++
> |   17.1335742042|
> ++
> {code}
> {code}
> >>> df = spark.range(1)
> >>> df.select(df['id'] + 17.133574204226083).show()
> +-+
> |(id + 17.133574204226083)|
> +-+
> |   17.133574204226083|
> +-+
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21278) Upgrade to Py4J 0.10.5

2017-07-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21278:


Assignee: (was: Apache Spark)

> Upgrade to Py4J 0.10.5
> --
>
> Key: SPARK-21278
> URL: https://issues.apache.org/jira/browse/SPARK-21278
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Dongjoon Hyun
>
> This issue aims to bump Py4J in order to fix the following float/double bug.
> Py4J 0.10.5 fixes this (https://github.com/bartdag/py4j/issues/272).
> {code}
> >>> df = spark.range(1)
> >>> df.select(df['id'] + 17.133574204226083).show()
> ++
> |(id + 17.1335742042)|
> ++
> |   17.1335742042|
> ++
> {code}
> {code}
> >>> df = spark.range(1)
> >>> df.select(df['id'] + 17.133574204226083).show()
> +-+
> |(id + 17.133574204226083)|
> +-+
> |   17.133574204226083|
> +-+
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21278) Upgrade to Py4J 0.10.5

2017-07-01 Thread Dongjoon Hyun (JIRA)
Dongjoon Hyun created SPARK-21278:
-

 Summary: Upgrade to Py4J 0.10.5
 Key: SPARK-21278
 URL: https://issues.apache.org/jira/browse/SPARK-21278
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.1.0, 2.2.0
Reporter: Dongjoon Hyun


This issue aims to bump Py4J in order to fix the following float/double bug.
Py4J 0.10.5 fixes this (https://github.com/bartdag/py4j/issues/272).

{code}
>>> df = spark.range(1)
>>> df.select(df['id'] + 17.133574204226083).show()
++
|(id + 17.1335742042)|
++
|   17.1335742042|
++
{code}

{code}
>>> df = spark.range(1)
>>> df.select(df['id'] + 17.133574204226083).show()
+-+
|(id + 17.133574204226083)|
+-+
|   17.133574204226083|
+-+
{code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20256) Fail to start SparkContext/SparkSession with Hive support enabled when user does not have read/write privilege to Hive metastore warehouse dir

2017-07-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20256:


Assignee: Apache Spark

> Fail to start SparkContext/SparkSession with Hive support enabled when user 
> does not have read/write privilege to Hive metastore warehouse dir
> --
>
> Key: SPARK-20256
> URL: https://issues.apache.org/jira/browse/SPARK-20256
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0, 2.1.1, 2.2.0
>Reporter: Xin Wu
>Assignee: Apache Spark
>Priority: Critical
>
> In a cluster setup with production Hive running, when the user wants to run 
> spark-shell using the production Hive metastore, hive-site.xml is copied to 
> SPARK_HOME/conf. So when spark-shell is being started, it tries to check 
> database existence of "default" database from Hive metastore. Yet, since this 
> user may not have READ/WRITE access to the configured Hive warehouse 
> directory done by Hive itself, such permission error will prevent spark-shell 
> or any spark application with Hive support enabled from starting at all. 
> Example error:
> {code}To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).
> java.lang.IllegalArgumentException: Error while instantiating 
> 'org.apache.spark.sql.hive.HiveSessionState':
>   at 
> org.apache.spark.sql.SparkSession$.org$apache$spark$sql$SparkSession$$reflect(SparkSession.scala:981)
>   at 
> org.apache.spark.sql.SparkSession.sessionState$lzycompute(SparkSession.scala:110)
>   at org.apache.spark.sql.SparkSession.sessionState(SparkSession.scala:109)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$getOrCreate$5.apply(SparkSession.scala:878)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$getOrCreate$5.apply(SparkSession.scala:878)
>   at 
> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)
>   at 
> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)
>   at 
> scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:230)
>   at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40)
>   at scala.collection.mutable.HashMap.foreach(HashMap.scala:99)
>   at 
> org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:878)
>   at org.apache.spark.repl.Main$.createSparkSession(Main.scala:95)
>   ... 47 elided
> Caused by: java.lang.reflect.InvocationTargetException: 
> org.apache.spark.sql.AnalysisException: 
> org.apache.hadoop.hive.ql.metadata.HiveException: 
> MetaException(message:java.security.AccessControlException: Permission 
> denied: user=notebook, access=READ, 
> inode="/apps/hive/warehouse":hive:hadoop:drwxrwx---
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:320)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:219)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:190)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1728)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1712)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPathAccess(FSDirectory.java:1686)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkAccess(FSNamesystem.java:8238)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.checkAccess(NameNodeRpcServer.java:1933)
>   at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.checkAccess(ClientNamenodeProtocolServerSideTranslatorPB.java:1455)
>   at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
>   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1697)
>   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2045)
> );
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> 

[jira] [Commented] (SPARK-20256) Fail to start SparkContext/SparkSession with Hive support enabled when user does not have read/write privilege to Hive metastore warehouse dir

2017-07-01 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16071459#comment-16071459
 ] 

Apache Spark commented on SPARK-20256:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/18501

> Fail to start SparkContext/SparkSession with Hive support enabled when user 
> does not have read/write privilege to Hive metastore warehouse dir
> --
>
> Key: SPARK-20256
> URL: https://issues.apache.org/jira/browse/SPARK-20256
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0, 2.1.1, 2.2.0
>Reporter: Xin Wu
>Priority: Critical
>
> In a cluster setup with production Hive running, when the user wants to run 
> spark-shell using the production Hive metastore, hive-site.xml is copied to 
> SPARK_HOME/conf. So when spark-shell is being started, it tries to check 
> database existence of "default" database from Hive metastore. Yet, since this 
> user may not have READ/WRITE access to the configured Hive warehouse 
> directory done by Hive itself, such permission error will prevent spark-shell 
> or any spark application with Hive support enabled from starting at all. 
> Example error:
> {code}To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).
> java.lang.IllegalArgumentException: Error while instantiating 
> 'org.apache.spark.sql.hive.HiveSessionState':
>   at 
> org.apache.spark.sql.SparkSession$.org$apache$spark$sql$SparkSession$$reflect(SparkSession.scala:981)
>   at 
> org.apache.spark.sql.SparkSession.sessionState$lzycompute(SparkSession.scala:110)
>   at org.apache.spark.sql.SparkSession.sessionState(SparkSession.scala:109)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$getOrCreate$5.apply(SparkSession.scala:878)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$getOrCreate$5.apply(SparkSession.scala:878)
>   at 
> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)
>   at 
> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)
>   at 
> scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:230)
>   at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40)
>   at scala.collection.mutable.HashMap.foreach(HashMap.scala:99)
>   at 
> org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:878)
>   at org.apache.spark.repl.Main$.createSparkSession(Main.scala:95)
>   ... 47 elided
> Caused by: java.lang.reflect.InvocationTargetException: 
> org.apache.spark.sql.AnalysisException: 
> org.apache.hadoop.hive.ql.metadata.HiveException: 
> MetaException(message:java.security.AccessControlException: Permission 
> denied: user=notebook, access=READ, 
> inode="/apps/hive/warehouse":hive:hadoop:drwxrwx---
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:320)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:219)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:190)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1728)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1712)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPathAccess(FSDirectory.java:1686)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkAccess(FSNamesystem.java:8238)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.checkAccess(NameNodeRpcServer.java:1933)
>   at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.checkAccess(ClientNamenodeProtocolServerSideTranslatorPB.java:1455)
>   at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
>   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1697)
>   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2045)
> );
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
> 

[jira] [Assigned] (SPARK-20256) Fail to start SparkContext/SparkSession with Hive support enabled when user does not have read/write privilege to Hive metastore warehouse dir

2017-07-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20256:


Assignee: (was: Apache Spark)

> Fail to start SparkContext/SparkSession with Hive support enabled when user 
> does not have read/write privilege to Hive metastore warehouse dir
> --
>
> Key: SPARK-20256
> URL: https://issues.apache.org/jira/browse/SPARK-20256
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0, 2.1.1, 2.2.0
>Reporter: Xin Wu
>Priority: Critical
>
> In a cluster setup with production Hive running, when the user wants to run 
> spark-shell using the production Hive metastore, hive-site.xml is copied to 
> SPARK_HOME/conf. So when spark-shell is being started, it tries to check 
> database existence of "default" database from Hive metastore. Yet, since this 
> user may not have READ/WRITE access to the configured Hive warehouse 
> directory done by Hive itself, such permission error will prevent spark-shell 
> or any spark application with Hive support enabled from starting at all. 
> Example error:
> {code}To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).
> java.lang.IllegalArgumentException: Error while instantiating 
> 'org.apache.spark.sql.hive.HiveSessionState':
>   at 
> org.apache.spark.sql.SparkSession$.org$apache$spark$sql$SparkSession$$reflect(SparkSession.scala:981)
>   at 
> org.apache.spark.sql.SparkSession.sessionState$lzycompute(SparkSession.scala:110)
>   at org.apache.spark.sql.SparkSession.sessionState(SparkSession.scala:109)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$getOrCreate$5.apply(SparkSession.scala:878)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$getOrCreate$5.apply(SparkSession.scala:878)
>   at 
> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)
>   at 
> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)
>   at 
> scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:230)
>   at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40)
>   at scala.collection.mutable.HashMap.foreach(HashMap.scala:99)
>   at 
> org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:878)
>   at org.apache.spark.repl.Main$.createSparkSession(Main.scala:95)
>   ... 47 elided
> Caused by: java.lang.reflect.InvocationTargetException: 
> org.apache.spark.sql.AnalysisException: 
> org.apache.hadoop.hive.ql.metadata.HiveException: 
> MetaException(message:java.security.AccessControlException: Permission 
> denied: user=notebook, access=READ, 
> inode="/apps/hive/warehouse":hive:hadoop:drwxrwx---
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:320)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:219)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:190)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1728)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1712)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPathAccess(FSDirectory.java:1686)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkAccess(FSNamesystem.java:8238)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.checkAccess(NameNodeRpcServer.java:1933)
>   at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.checkAccess(ClientNamenodeProtocolServerSideTranslatorPB.java:1455)
>   at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
>   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1697)
>   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2045)
> );
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
> 

[jira] [Commented] (SPARK-21244) KMeans applied to processed text day clumps almost all documents into one cluster

2017-07-01 Thread Nassir (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16071447#comment-16071447
 ] 

Nassir commented on SPARK-21244:


Hi, 

The pyspark k-means implementation is on the same 20 newsgroup document set 
that sklearn k-means is run on. 

pyspark version does not produce any meaningful clsuters, unlike the sklearn 
k-means (both using euclidean distance as a distance measure).

The 'bug' is that pyspark k-means applied to tf-idf documents does not provide 
expected results. I would be interested to know if anyone has used k-means in 
spark mllib to cluster a standard document set such as the 20 news group set? 
Do you get almost all the documents clump into one cluster as I do?

> KMeans applied to processed text day clumps almost all documents into one 
> cluster
> -
>
> Key: SPARK-21244
> URL: https://issues.apache.org/jira/browse/SPARK-21244
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.1.1
>Reporter: Nassir
>
> I have observed this problem for quite a while now regarding the 
> implementation of pyspark KMeans on text documents - to cluster documents 
> according to their TF-IDF vectors. The pyspark implementation - even on 
> standard datasets - clusters almost all of the documents into one cluster. 
> I implemented K-means on the same dataset with same parameters using SKlearn 
> library, and this clusters the documents very well. 
> I recommend anyone who is able to test the pyspark implementation of KMeans 
> on text documents - which obviously has a bug in it somewhere.
> (currently I am convert my spark dataframe to pandas dataframe and running k 
> means and converting back. However, this is of course not a parallel solution 
> capable of handling huge amounts of data in future)
> Here is a link to the question i posted a while back on stackoverlfow: 
> https://stackoverflow.com/questions/43863373/tf-idf-document-clustering-with-k-means-in-apache-spark-putting-points-into-one



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20256) Fail to start SparkContext/SparkSession with Hive support enabled when user does not have read/write privilege to Hive metastore warehouse dir

2017-07-01 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16071414#comment-16071414
 ] 

Dongjoon Hyun commented on SPARK-20256:
---

Hi, [~xwu0226].
Are you still preparing a unit test case?

> Fail to start SparkContext/SparkSession with Hive support enabled when user 
> does not have read/write privilege to Hive metastore warehouse dir
> --
>
> Key: SPARK-20256
> URL: https://issues.apache.org/jira/browse/SPARK-20256
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0, 2.1.1, 2.2.0
>Reporter: Xin Wu
>Priority: Critical
>
> In a cluster setup with production Hive running, when the user wants to run 
> spark-shell using the production Hive metastore, hive-site.xml is copied to 
> SPARK_HOME/conf. So when spark-shell is being started, it tries to check 
> database existence of "default" database from Hive metastore. Yet, since this 
> user may not have READ/WRITE access to the configured Hive warehouse 
> directory done by Hive itself, such permission error will prevent spark-shell 
> or any spark application with Hive support enabled from starting at all. 
> Example error:
> {code}To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).
> java.lang.IllegalArgumentException: Error while instantiating 
> 'org.apache.spark.sql.hive.HiveSessionState':
>   at 
> org.apache.spark.sql.SparkSession$.org$apache$spark$sql$SparkSession$$reflect(SparkSession.scala:981)
>   at 
> org.apache.spark.sql.SparkSession.sessionState$lzycompute(SparkSession.scala:110)
>   at org.apache.spark.sql.SparkSession.sessionState(SparkSession.scala:109)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$getOrCreate$5.apply(SparkSession.scala:878)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$getOrCreate$5.apply(SparkSession.scala:878)
>   at 
> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)
>   at 
> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)
>   at 
> scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:230)
>   at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40)
>   at scala.collection.mutable.HashMap.foreach(HashMap.scala:99)
>   at 
> org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:878)
>   at org.apache.spark.repl.Main$.createSparkSession(Main.scala:95)
>   ... 47 elided
> Caused by: java.lang.reflect.InvocationTargetException: 
> org.apache.spark.sql.AnalysisException: 
> org.apache.hadoop.hive.ql.metadata.HiveException: 
> MetaException(message:java.security.AccessControlException: Permission 
> denied: user=notebook, access=READ, 
> inode="/apps/hive/warehouse":hive:hadoop:drwxrwx---
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:320)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:219)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:190)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1728)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1712)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPathAccess(FSDirectory.java:1686)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkAccess(FSNamesystem.java:8238)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.checkAccess(NameNodeRpcServer.java:1933)
>   at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.checkAccess(ClientNamenodeProtocolServerSideTranslatorPB.java:1455)
>   at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
>   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1697)
>   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2045)
> );
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> 

[jira] [Updated] (SPARK-21267) Improvements to the Structured Streaming programming guide

2017-07-01 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-21267:
-
Target Version/s:   (was: 2.2.0)

> Improvements to the Structured Streaming programming guide
> --
>
> Key: SPARK-21267
> URL: https://issues.apache.org/jira/browse/SPARK-21267
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, Structured Streaming
>Affects Versions: 2.1.1
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>
> - Add information for Ganglia
> - Add Kafka Sink to the main docs
> - Move Structured Streaming above Spark Streaming



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15526) Shade JPMML

2017-07-01 Thread Villu Ruusmann (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16071392#comment-16071392
 ] 

Villu Ruusmann commented on SPARK-15526:


This issue is responsible for >95% of support requests around my JPMML-SparkML 
and JPMML-SparkML-Package projects.

I've given up teaching people how to use Apache Maven Shade plugin to work 
around it. For Apache Spark 2.0 and 2.1 users the "official fix" is to delete 
the two offending JAR files `$SPARK_HOME/jars/pmml-model-1.2.15.jar` and 
`$SPARK_HOME/jars/pmml-schema-1.2.15.jar`. Sure, it will break built-in MLlib 
PMML export functionality, but anyone who is interested in using their 
JPMML-SparkML replacements, shouldn't mind it.

> Shade JPMML
> ---
>
> Key: SPARK-15526
> URL: https://issues.apache.org/jira/browse/SPARK-15526
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: ML, MLlib
>Affects Versions: 2.0.0
>Reporter: Villu Ruusmann
>Priority: Minor
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> The Spark-MLlib module depends on the JPMML-Model library 
> (org.jpmml:pmml-model:1.2.7) for its PMML export capabilities. The 
> JPMML-Model library is included in the Apache Spark assembly, which makes it 
> very difficult to build and deploy competing PMML exporters that may wish to 
> depend on different versions (typically much newer) of the same library.
> JPMML-Model library classes are not part of Apache Spark public APIs, so it 
> shouldn't be a problem if they are relocated by prepending a prefix 
> "org.spark_project" to their package names using Maven Shade Plugin. The 
> requested treatment is identical to how Google Guava and Jetty dependencies 
> are shaded in the final assembly.
> This issue is raised in relation to the JPMML-SparkML project 
> (https://github.com/jpmml/jpmml-sparkml), which provides PMML export 
> capabilities for Spark ML Pipelines. Currently, application developers who 
> wish to use it must tweak their application classpath, which assumes 
> familiarity with build internals.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15533) Deprecate Dataset.explode

2017-07-01 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16071325#comment-16071325
 ] 

Reynold Xin commented on SPARK-15533:
-

Just use a star.

On Sat, Jul 1, 2017 at 9:33 AM Sagara Paranagama (JIRA) 



> Deprecate Dataset.explode
> -
>
> Key: SPARK-15533
> URL: https://issues.apache.org/jira/browse/SPARK-15533
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Sameer Agarwal
> Fix For: 2.0.0
>
>
> See discussion on the mailing list: 
> http://mail-archives.apache.org/mod_mbox/spark-user/201605.mbox/browser
> We should deprecate Dataset.explode, and point users to Dataset.flatMap and 
> functions.explode with select.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21277) Spark is invoking an incorrect serializer after UDAF completion

2017-07-01 Thread Erik Erlandson (JIRA)
Erik Erlandson created SPARK-21277:
--

 Summary: Spark is invoking an incorrect serializer after UDAF 
completion
 Key: SPARK-21277
 URL: https://issues.apache.org/jira/browse/SPARK-21277
 Project: Spark
  Issue Type: Bug
  Components: Optimizer, SQL
Affects Versions: 2.1.0
Reporter: Erik Erlandson


I'm writing a UDAF that also requires some custom UDT implementations.  The 
UDAF (and UDT) logic appear to be executing properly up through the final UDAF 
call to the {{evaluate}} method. However, after the evaluate method completes, 
I am seeing the UDT {{deserialize}} method being called another time, however 
this time it is being invoked on data that wasn't produced by my corresponding 
{{serialize}} method, and it is crashing.  The following REPL output shows the 
execution and completion of {{evaluate}}, and then another call to 
{{deserialize}} that sees some kind of {{UnsafeArrayData}} object that my 
serialization doesn't produce, and so the method fails:

{code}entering evaluate
a= 
[[0.5,10,2,org.apache.spark.sql.catalyst.expressions.UnsafeArrayData@f1813f2c,org.apache.spark.sql.catalyst.expressions.UnsafeArrayData@b3587fc7],[0.5,10,4,org.apache.spark.sql.catalyst.expressions.UnsafeArrayData@d3065487,org.apache.spark.sql.catalyst.expressions.UnsafeArrayData@f1a5ace9],[0.5,10,4,org.apache.spark.sql.catalyst.expressions.UnsafeArrayData@d01fbbcf,org.apache.spark.sql.catalyst.expressions.UnsafeArrayData@f1a5ace9]]
leaving evaluate
a= org.apache.spark.sql.catalyst.expressions.UnsafeArrayData@27d73513
java.lang.RuntimeException: Error while decoding: 
java.lang.UnsupportedOperationException: Not supported on UnsafeArrayData.
createexternalrow(newInstance(class 
org.apache.spark.isarnproject.sketches.udt.TDigestArrayUDT).deserialize, 
StructField(tdigestmlvecudaf(features),TDigestArrayUDT,true))
{code}

To reproduce, check out the branch {{first-cut}} of {{isarn-sketches-spark}}:
https://github.com/erikerlandson/isarn-sketches-spark/tree/first-cut

Then invoke {{xsbt console}} to get a REPL with a spark session.  In the REPL 
execute:
{code}
Welcome to Scala 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_131).
Type in expressions for evaluation. Or try :help.

scala> val training = spark.createDataFrame(Seq((1.0, Vectors.dense(0.0, 1.1, 
0.1)),(0.0, Vectors.dense(2.0, 1.0, -1.0)),(0.0, Vectors.dense(2.0, 1.3, 
1.0)),(1.0, Vectors.dense(0.0, 1.2, -0.5.toDF("label", "features")
training: org.apache.spark.sql.DataFrame = [label: double, features: vector]

scala> val featTD = training.agg(TDigestMLVecUDAF(0.5,10)(training("features")))
featTD: org.apache.spark.sql.DataFrame = [tdigestmlvecudaf(features): 
tdigestarray]

scala> featTD.first
{code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15533) Deprecate Dataset.explode

2017-07-01 Thread Sagara Paranagama (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16071297#comment-16071297
 ] 

Sagara Paranagama commented on SPARK-15533:
---

But what if my dataframe contained over 50 columns? In my case, I'm dealing 
with a Hive table which has a lot of columns. Is there a way to explode the one 
column in my dataframe without referring to all the other columns? Also, I'm 
exploding using a custom lambda function, not with Spark built-in ones.

> Deprecate Dataset.explode
> -
>
> Key: SPARK-15533
> URL: https://issues.apache.org/jira/browse/SPARK-15533
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Sameer Agarwal
> Fix For: 2.0.0
>
>
> See discussion on the mailing list: 
> http://mail-archives.apache.org/mod_mbox/spark-user/201605.mbox/browser
> We should deprecate Dataset.explode, and point users to Dataset.flatMap and 
> functions.explode with select.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21170) Utils.tryWithSafeFinallyAndFailureCallbacks throws IllegalArgumentException: Self-suppression not permitted

2017-07-01 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-21170.
---
   Resolution: Fixed
 Assignee: Devaraj K
Fix Version/s: 2.3.0
   2.2.1

Resolved by https://github.com/apache/spark/pull/18384

> Utils.tryWithSafeFinallyAndFailureCallbacks throws IllegalArgumentException: 
> Self-suppression not permitted
> ---
>
> Key: SPARK-21170
> URL: https://issues.apache.org/jira/browse/SPARK-21170
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Devaraj K
>Assignee: Devaraj K
>Priority: Minor
> Fix For: 2.2.1, 2.3.0
>
>
> {code:xml}
> 17/06/20 22:49:39 ERROR Executor: Exception in task 225.0 in stage 1.0 (TID 
> 27225)
> java.lang.IllegalArgumentException: Self-suppression not permitted
> at java.lang.Throwable.addSuppressed(Throwable.java:1043)
> at 
> org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1400)
> at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1145)
> at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1125)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
> at org.apache.spark.scheduler.Task.run(Task.scala:108)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:341)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> {code:xml}
> 17/06/20 22:52:32 INFO scheduler.TaskSetManager: Lost task 427.0 in stage 1.0 
> (TID 27427) on 192.168.1.121, executor 12: java.lang.IllegalArgumentException 
> (Self-suppression not permitted) [duplicate 1]
> 17/06/20 22:52:33 INFO scheduler.TaskSetManager: Starting task 427.1 in stage 
> 1.0 (TID 27764, 192.168.1.122, executor 106, partition 427, PROCESS_LOCAL, 
> 4625 bytes)
> 17/06/20 22:52:33 INFO scheduler.TaskSetManager: Lost task 186.0 in stage 1.0 
> (TID 27186) on 192.168.1.122, executor 106: 
> java.lang.IllegalArgumentException (Self-suppression not permitted) 
> [duplicate 2]
> 17/06/20 22:52:38 INFO scheduler.TaskSetManager: Starting task 186.1 in stage 
> 1.0 (TID 27765, 192.168.1.121, executor 9, partition 186, PROCESS_LOCAL, 4625 
> bytes)
> 17/06/20 22:52:38 WARN scheduler.TaskSetManager: Lost task 392.0 in stage 1.0 
> (TID 27392, 192.168.1.121, executor 9): java.lang.IllegalArgumentException: 
> Self-suppression not permitted
>   at java.lang.Throwable.addSuppressed(Throwable.java:1043)
>   at 
> org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1400)
>   at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1145)
>   at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1125)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:108)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:341)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> Here it is trying to suppress the same Throwable instance and causing to 
> throw the IllegalArgumentException which masks the original exception.
> I think it should not add to the suppressed if it is the same instance.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21186) PySpark with --packages fails to import library due to lack of pythonpath to .ivy2/jars/*.jar

2017-07-01 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16071254#comment-16071254
 ] 

Jeff Zhang commented on SPARK-21186:


I think this is due to how spark-deep-learning distribute its jar and python 
code. --packages is supposed to be for jar not for python stuff. so spark won't 
put it under PYTHONPATH

> PySpark with --packages fails to import library due to lack of pythonpath to 
> .ivy2/jars/*.jar
> -
>
> Key: SPARK-21186
> URL: https://issues.apache.org/jira/browse/SPARK-21186
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.0
> Environment: Spark is downloaded and compiled by myself.
> Spark: 2.2.0-SNAPSHOT
> Python: Anaconda Python2 (on client and workers)
>Reporter: HanCheol Cho
>Priority: Minor
>
> I experienced "ImportError: No module named sparkdl" exception while trying 
> to use databricks' spark-deep-learning (sparkdl) in PySpark.
> The package is included with --packages option as below.
> {code}
> $ pyspark --packages databricks:spark-deep-learning:0.1.0-spark2.1-s_2.11
> >>> import sparkdl
> Traceback (most recent call last):
>   File "", line 1, in 
> ImportError: No module named sparkdl
> {code}
> The problem was that PySpark fails to detect this package's jar files located 
> in .ivy2/jars/ directory.
> I could circumvent this issue by manually adding this path to PYTHONPATH 
> after launching PySpark as follows.
> {code}
> $ pyspark --packages databricks:spark-deep-learning:0.1.0-spark2.1-s_2.11
> >>> import sys, glob, os
> >>> sys.path.extend(glob.glob(os.path.join(os.path.expanduser("~"), 
> >>> ".ivy2/jars/*.jar")))
> >>>
> >>> import sparkdl
> Using TensorFlow backend.
> >>> my_images = sparkdl.readImages("data/flower_photos/daisy/*.jpg")
> >>> my_images.show()
> +++
> |filePath|   image|
> +++
> |hdfs://mycluster/...|[RGB,263,320,3,[B...|
> |hdfs://mycluster/...|[RGB,313,500,3,[B...|
> |hdfs://mycluster/...|[RGB,215,320,3,[B...|
> ...
> {code}
> I think that it may be better to add ivy2/jar directory path to PYTHONPATH 
> while launching PySpark.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21206) the window slice of Dstream is wrong

2017-07-01 Thread Fei Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16071229#comment-16071229
 ] 

Fei Shao commented on SPARK-21206:
--

The window size is correct. The oldRDDs and newRDDs is incorrect.

> the window slice of Dstream is wrong
> 
>
> Key: SPARK-21206
> URL: https://issues.apache.org/jira/browse/SPARK-21206
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.1.0
>Reporter: Fei Shao
> Attachments: screenshot-1.png, screenshot-2.png
>
>
> the code is :
> val conf = new SparkConf().setAppName("testDstream").setMaster("local[4]")
> val ssc = new StreamingContext(conf, Seconds(1))
> ssc.checkpoint( "path")
> val lines = ssc.socketTextStream("IP", PORT)
> lines.countByValueAndWindow( Seconds(2), Seconds(8)).foreachRDD( s => {
>   println( "RDD ID IS : " + s.id)
>   s.foreach( e => println("data is " + e._1 + " :" + e._2))
>   println()
> })
> The result is wrong. 
> I checked the log, it showed:
> 17/06/25 17:31:26 DEBUG ReducedWindowedDStream: Time 1498383086000 ms is valid
> 17/06/25 17:31:26 DEBUG ReducedWindowedDStream: Window time = 2000 ms
> 17/06/25 17:31:26 DEBUG ReducedWindowedDStream: Slide time = 8000 ms
> 17/06/25 17:31:26 DEBUG ReducedWindowedDStream: Zero time = 1498383078000 ms
> 17/06/25 17:31:26 DEBUG ReducedWindowedDStream: Current window = 
> [1498383085000 ms, 1498383086000 ms]
> 17/06/25 17:31:26 DEBUG ReducedWindowedDStream: Previous window = 
> [1498383077000 ms, 1498383078000 ms]
> 17/06/25 17:31:26 INFO ShuffledDStream: Slicing from 1498383077000 ms to 
> 1498383084000 ms (aligned to 1498383077000 ms and 1498383084000 ms)
> 17/06/25 17:31:26 INFO ShuffledDStream: Time 1498383078000 ms is invalid as 
> zeroTime is 1498383078000 ms , slideDuration is 1000 ms and difference is 0 ms
> 17/06/25 17:31:26 DEBUG ShuffledDStream: Time 1498383079000 ms is valid
> 17/06/25 17:31:26 DEBUG MappedDStream: Time 1498383079000 ms is valid
> the slice time is wrong.
> [BTW]: Team members,
> If it was a bug, please don't fix it.I try to fix it myself.Thanks:)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21176) Master UI hangs with spark.ui.reverseProxy=true if the master node has many CPUs

2017-07-01 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16071176#comment-16071176
 ] 

Apache Spark commented on SPARK-21176:
--

User 'aosagie' has created a pull request for this issue:
https://github.com/apache/spark/pull/18499

> Master UI hangs with spark.ui.reverseProxy=true if the master node has many 
> CPUs
> 
>
> Key: SPARK-21176
> URL: https://issues.apache.org/jira/browse/SPARK-21176
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.1.0, 2.1.1, 2.2.0, 2.2.1
> Environment: ppc64le GNU/Linux, POWER8, only master node is reachable 
> externally other nodes are in an internal network
>Reporter: Ingo Schuster
>Assignee: Ingo Schuster
>  Labels: network, web-ui
> Fix For: 2.1.2, 2.2.0
>
>
> In reverse proxy mode, Sparks exhausts the Jetty thread pool if the master 
> node has too many cpus or the cluster has too many executers:
> For each ProxyServlet, Jetty creates Selector threads: minimum 4, maximum 
> half the number of available CPUs:
> {{this(Math.max(1, Runtime.getRuntime().availableProcessors() / 2));}}
> (see 
> https://github.com/eclipse/jetty.project/blob/0c8273f2ca1f9bf2064cd9c4c939d2546443f759/jetty-client/src/main/java/org/eclipse/jetty/client/http/HttpClientTransportOverHTTP.java)
> In reverse proxy mode, a proxy servlet is set up for each executor.
> I have a system with 7 executors and 88 CPUs on the master node. Jetty tries 
> to instantiate 7*44 = 309 selector threads just for the reverse proxy 
> servlets, but since the QueuedThreadPool is initialized with 200 threads by 
> default, the UI gets stuck.
> I have patched JettyUtils.scala to extend the thread pool ( {{val pool = new 
> QueuedThreadPool(400)}}). With this hack, the UI works.
> Obviously, the Jetty defaults are meant for a real web server. If that has 88 
> CPUs, you do certainly expect a lot of traffic.
> For the Spark admin UI however, there will rarely be concurrent accesses for 
> the same application or the same executor.
> I therefore propose to dramatically reduce the number of selector threads 
> that get instantiated - at least by default.
> I will propose a fix in a pull request.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21206) the window slice of Dstream is wrong

2017-07-01 Thread Fei Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16071110#comment-16071110
 ] 

Fei Shao edited comment on SPARK-21206 at 7/1/17 9:33 AM:
--

[~srowen]

This issue is about the xpressions about oldRDDs and newRDDs.
Here is the case I want to express:

!screenshot-2.png!


I think the code is:
=
// Get the RDDs of the reduced values in "old time steps"
val oldRDDs =
  if (windowDuration < slideDuration ) {
reducedStream.slice(previousWindow.beginTime, previousWindow.endTime)
  }
  else {
reducedStream.slice(previousWindow.beginTime,
  currentWindow.beginTime - parent.slideDuration)
  }

logDebug("# old RDDs = " + oldRDDs.size)

// Get the RDDs of the reduced values in "new time steps"
val newRDDs =
  if (windowDuration < slideDuration ) {
reducedStream.slice(currentWindow.beginTime, currentWindow.endTime)
  }
  else {
reducedStream.slice(previousWindow.endTime + parent.slideDuration, 
currentWindow.endTime)
  }

logDebug("# new RDDs = " + newRDDs.size)
==

I sent an email named "the function of countByValueAndWindow and foreachRDD in 
DStream, would you like help me understand it please?".
It is related to this issues.



was (Author: robin shao):
[~srowen]

Here is the case I want to express:

!screenshot-2.png!

I sent an email named "the function of countByValueAndWindow and foreachRDD in 
DStream, would you like help me understand it please?".
It is related to this issues.


> the window slice of Dstream is wrong
> 
>
> Key: SPARK-21206
> URL: https://issues.apache.org/jira/browse/SPARK-21206
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.1.0
>Reporter: Fei Shao
> Attachments: screenshot-1.png, screenshot-2.png
>
>
> the code is :
> val conf = new SparkConf().setAppName("testDstream").setMaster("local[4]")
> val ssc = new StreamingContext(conf, Seconds(1))
> ssc.checkpoint( "path")
> val lines = ssc.socketTextStream("IP", PORT)
> lines.countByValueAndWindow( Seconds(2), Seconds(8)).foreachRDD( s => {
>   println( "RDD ID IS : " + s.id)
>   s.foreach( e => println("data is " + e._1 + " :" + e._2))
>   println()
> })
> The result is wrong. 
> I checked the log, it showed:
> 17/06/25 17:31:26 DEBUG ReducedWindowedDStream: Time 1498383086000 ms is valid
> 17/06/25 17:31:26 DEBUG ReducedWindowedDStream: Window time = 2000 ms
> 17/06/25 17:31:26 DEBUG ReducedWindowedDStream: Slide time = 8000 ms
> 17/06/25 17:31:26 DEBUG ReducedWindowedDStream: Zero time = 1498383078000 ms
> 17/06/25 17:31:26 DEBUG ReducedWindowedDStream: Current window = 
> [1498383085000 ms, 1498383086000 ms]
> 17/06/25 17:31:26 DEBUG ReducedWindowedDStream: Previous window = 
> [1498383077000 ms, 1498383078000 ms]
> 17/06/25 17:31:26 INFO ShuffledDStream: Slicing from 1498383077000 ms to 
> 1498383084000 ms (aligned to 1498383077000 ms and 1498383084000 ms)
> 17/06/25 17:31:26 INFO ShuffledDStream: Time 1498383078000 ms is invalid as 
> zeroTime is 1498383078000 ms , slideDuration is 1000 ms and difference is 0 ms
> 17/06/25 17:31:26 DEBUG ShuffledDStream: Time 1498383079000 ms is valid
> 17/06/25 17:31:26 DEBUG MappedDStream: Time 1498383079000 ms is valid
> the slice time is wrong.
> [BTW]: Team members,
> If it was a bug, please don't fix it.I try to fix it myself.Thanks:)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21244) KMeans applied to processed text day clumps almost all documents into one cluster

2017-07-01 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-21244.
---
Resolution: Not A Problem

> KMeans applied to processed text day clumps almost all documents into one 
> cluster
> -
>
> Key: SPARK-21244
> URL: https://issues.apache.org/jira/browse/SPARK-21244
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.1.1
>Reporter: Nassir
>
> I have observed this problem for quite a while now regarding the 
> implementation of pyspark KMeans on text documents - to cluster documents 
> according to their TF-IDF vectors. The pyspark implementation - even on 
> standard datasets - clusters almost all of the documents into one cluster. 
> I implemented K-means on the same dataset with same parameters using SKlearn 
> library, and this clusters the documents very well. 
> I recommend anyone who is able to test the pyspark implementation of KMeans 
> on text documents - which obviously has a bug in it somewhere.
> (currently I am convert my spark dataframe to pandas dataframe and running k 
> means and converting back. However, this is of course not a parallel solution 
> capable of handling huge amounts of data in future)
> Here is a link to the question i posted a while back on stackoverlfow: 
> https://stackoverflow.com/questions/43863373/tf-idf-document-clustering-with-k-means-in-apache-spark-putting-points-into-one



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21233) Support pluggable offset storage

2017-07-01 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-21233.
---
Resolution: Not A Problem

> Support pluggable offset storage
> 
>
> Key: SPARK-21233
> URL: https://issues.apache.org/jira/browse/SPARK-21233
> Project: Spark
>  Issue Type: New Feature
>  Components: DStreams
>Affects Versions: 2.0.2, 2.1.1
>Reporter: darion yaphet
>
> Currently we using *ZooKeeper* to save the *Kafka Commit Offset* , when there 
> are a lot of streaming program running in the cluster the ZooKeeper Cluster's 
> loading is very high . Maybe Zookeeper is not very suitable to save offset 
> periodicity.
> This issue is wish to support a pluggable offset storage to avoid save it in 
> the zookeeper 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21266) Support schema a DDL-formatted string in dapply/gapply/from_json

2017-07-01 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16071117#comment-16071117
 ] 

Apache Spark commented on SPARK-21266:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/18498

> Support schema a DDL-formatted string in dapply/gapply/from_json
> 
>
> Key: SPARK-21266
> URL: https://issues.apache.org/jira/browse/SPARK-21266
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SparkR
>Affects Versions: 2.2.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> A DDL-formatted string is now supported in schema API in dataframe 
> reader/writer across other language APIs. 
> {{from_json}} in R/Python look not supporting this.
> Also, It could be done in other commonly used APIs too in R specifically - 
> {{dapply}}/{{gapply}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21266) Support schema a DDL-formatted string in dapply/gapply/from_json

2017-07-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21266:


Assignee: (was: Apache Spark)

> Support schema a DDL-formatted string in dapply/gapply/from_json
> 
>
> Key: SPARK-21266
> URL: https://issues.apache.org/jira/browse/SPARK-21266
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SparkR
>Affects Versions: 2.2.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> A DDL-formatted string is now supported in schema API in dataframe 
> reader/writer across other language APIs. 
> {{from_json}} in R/Python look not supporting this.
> Also, It could be done in other commonly used APIs too in R specifically - 
> {{dapply}}/{{gapply}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21266) Support schema a DDL-formatted string in dapply/gapply/from_json

2017-07-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21266:


Assignee: Apache Spark

> Support schema a DDL-formatted string in dapply/gapply/from_json
> 
>
> Key: SPARK-21266
> URL: https://issues.apache.org/jira/browse/SPARK-21266
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SparkR
>Affects Versions: 2.2.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Minor
>
> A DDL-formatted string is now supported in schema API in dataframe 
> reader/writer across other language APIs. 
> {{from_json}} in R/Python look not supporting this.
> Also, It could be done in other commonly used APIs too in R specifically - 
> {{dapply}}/{{gapply}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21078) JobHistory applications synchronized is invalid

2017-07-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21078:


Assignee: (was: Apache Spark)

> JobHistory applications synchronized is invalid
> ---
>
> Key: SPARK-21078
> URL: https://issues.apache.org/jira/browse/SPARK-21078
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: fangfengbin
>
> FsHistoryProvider has some threads to mergeApplicationListing.
> mergeApplicationListing function has code like this:
> applications.synchronized { 
> ...
> applications = mergedApps 
> ...
> } 
> applications object is changed,  so the synchronized do not work.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21078) JobHistory applications synchronized is invalid

2017-07-01 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16071115#comment-16071115
 ] 

Apache Spark commented on SPARK-21078:
--

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/18497

> JobHistory applications synchronized is invalid
> ---
>
> Key: SPARK-21078
> URL: https://issues.apache.org/jira/browse/SPARK-21078
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: fangfengbin
>
> FsHistoryProvider has some threads to mergeApplicationListing.
> mergeApplicationListing function has code like this:
> applications.synchronized { 
> ...
> applications = mergedApps 
> ...
> } 
> applications object is changed,  so the synchronized do not work.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21078) JobHistory applications synchronized is invalid

2017-07-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21078:


Assignee: Apache Spark

> JobHistory applications synchronized is invalid
> ---
>
> Key: SPARK-21078
> URL: https://issues.apache.org/jira/browse/SPARK-21078
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: fangfengbin
>Assignee: Apache Spark
>
> FsHistoryProvider has some threads to mergeApplicationListing.
> mergeApplicationListing function has code like this:
> applications.synchronized { 
> ...
> applications = mergedApps 
> ...
> } 
> applications object is changed,  so the synchronized do not work.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21206) the window slice of Dstream is wrong

2017-07-01 Thread Fei Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16071110#comment-16071110
 ] 

Fei Shao commented on SPARK-21206:
--

[~srowen]

Here is the case I want to express:

!screenshot-2.png!

I sent an email named "the function of countByValueAndWindow and foreachRDD in 
DStream, would you like help me understand it please?".
It is related to this issues.


> the window slice of Dstream is wrong
> 
>
> Key: SPARK-21206
> URL: https://issues.apache.org/jira/browse/SPARK-21206
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.1.0
>Reporter: Fei Shao
> Attachments: screenshot-1.png, screenshot-2.png
>
>
> the code is :
> val conf = new SparkConf().setAppName("testDstream").setMaster("local[4]")
> val ssc = new StreamingContext(conf, Seconds(1))
> ssc.checkpoint( "path")
> val lines = ssc.socketTextStream("IP", PORT)
> lines.countByValueAndWindow( Seconds(2), Seconds(8)).foreachRDD( s => {
>   println( "RDD ID IS : " + s.id)
>   s.foreach( e => println("data is " + e._1 + " :" + e._2))
>   println()
> })
> The result is wrong. 
> I checked the log, it showed:
> 17/06/25 17:31:26 DEBUG ReducedWindowedDStream: Time 1498383086000 ms is valid
> 17/06/25 17:31:26 DEBUG ReducedWindowedDStream: Window time = 2000 ms
> 17/06/25 17:31:26 DEBUG ReducedWindowedDStream: Slide time = 8000 ms
> 17/06/25 17:31:26 DEBUG ReducedWindowedDStream: Zero time = 1498383078000 ms
> 17/06/25 17:31:26 DEBUG ReducedWindowedDStream: Current window = 
> [1498383085000 ms, 1498383086000 ms]
> 17/06/25 17:31:26 DEBUG ReducedWindowedDStream: Previous window = 
> [1498383077000 ms, 1498383078000 ms]
> 17/06/25 17:31:26 INFO ShuffledDStream: Slicing from 1498383077000 ms to 
> 1498383084000 ms (aligned to 1498383077000 ms and 1498383084000 ms)
> 17/06/25 17:31:26 INFO ShuffledDStream: Time 1498383078000 ms is invalid as 
> zeroTime is 1498383078000 ms , slideDuration is 1000 ms and difference is 0 ms
> 17/06/25 17:31:26 DEBUG ShuffledDStream: Time 1498383079000 ms is valid
> 17/06/25 17:31:26 DEBUG MappedDStream: Time 1498383079000 ms is valid
> the slice time is wrong.
> [BTW]: Team members,
> If it was a bug, please don't fix it.I try to fix it myself.Thanks:)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21206) the window slice of Dstream is wrong

2017-07-01 Thread Fei Shao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fei Shao updated SPARK-21206:
-
Attachment: screenshot-2.png

> the window slice of Dstream is wrong
> 
>
> Key: SPARK-21206
> URL: https://issues.apache.org/jira/browse/SPARK-21206
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.1.0
>Reporter: Fei Shao
> Attachments: screenshot-1.png, screenshot-2.png
>
>
> the code is :
> val conf = new SparkConf().setAppName("testDstream").setMaster("local[4]")
> val ssc = new StreamingContext(conf, Seconds(1))
> ssc.checkpoint( "path")
> val lines = ssc.socketTextStream("IP", PORT)
> lines.countByValueAndWindow( Seconds(2), Seconds(8)).foreachRDD( s => {
>   println( "RDD ID IS : " + s.id)
>   s.foreach( e => println("data is " + e._1 + " :" + e._2))
>   println()
> })
> The result is wrong. 
> I checked the log, it showed:
> 17/06/25 17:31:26 DEBUG ReducedWindowedDStream: Time 1498383086000 ms is valid
> 17/06/25 17:31:26 DEBUG ReducedWindowedDStream: Window time = 2000 ms
> 17/06/25 17:31:26 DEBUG ReducedWindowedDStream: Slide time = 8000 ms
> 17/06/25 17:31:26 DEBUG ReducedWindowedDStream: Zero time = 1498383078000 ms
> 17/06/25 17:31:26 DEBUG ReducedWindowedDStream: Current window = 
> [1498383085000 ms, 1498383086000 ms]
> 17/06/25 17:31:26 DEBUG ReducedWindowedDStream: Previous window = 
> [1498383077000 ms, 1498383078000 ms]
> 17/06/25 17:31:26 INFO ShuffledDStream: Slicing from 1498383077000 ms to 
> 1498383084000 ms (aligned to 1498383077000 ms and 1498383084000 ms)
> 17/06/25 17:31:26 INFO ShuffledDStream: Time 1498383078000 ms is invalid as 
> zeroTime is 1498383078000 ms , slideDuration is 1000 ms and difference is 0 ms
> 17/06/25 17:31:26 DEBUG ShuffledDStream: Time 1498383079000 ms is valid
> 17/06/25 17:31:26 DEBUG MappedDStream: Time 1498383079000 ms is valid
> the slice time is wrong.
> [BTW]: Team members,
> If it was a bug, please don't fix it.I try to fix it myself.Thanks:)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21206) the window slice of Dstream is wrong

2017-07-01 Thread Fei Shao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fei Shao updated SPARK-21206:
-
Attachment: screenshot-1.png

> the window slice of Dstream is wrong
> 
>
> Key: SPARK-21206
> URL: https://issues.apache.org/jira/browse/SPARK-21206
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.1.0
>Reporter: Fei Shao
> Attachments: screenshot-1.png
>
>
> the code is :
> val conf = new SparkConf().setAppName("testDstream").setMaster("local[4]")
> val ssc = new StreamingContext(conf, Seconds(1))
> ssc.checkpoint( "path")
> val lines = ssc.socketTextStream("IP", PORT)
> lines.countByValueAndWindow( Seconds(2), Seconds(8)).foreachRDD( s => {
>   println( "RDD ID IS : " + s.id)
>   s.foreach( e => println("data is " + e._1 + " :" + e._2))
>   println()
> })
> The result is wrong. 
> I checked the log, it showed:
> 17/06/25 17:31:26 DEBUG ReducedWindowedDStream: Time 1498383086000 ms is valid
> 17/06/25 17:31:26 DEBUG ReducedWindowedDStream: Window time = 2000 ms
> 17/06/25 17:31:26 DEBUG ReducedWindowedDStream: Slide time = 8000 ms
> 17/06/25 17:31:26 DEBUG ReducedWindowedDStream: Zero time = 1498383078000 ms
> 17/06/25 17:31:26 DEBUG ReducedWindowedDStream: Current window = 
> [1498383085000 ms, 1498383086000 ms]
> 17/06/25 17:31:26 DEBUG ReducedWindowedDStream: Previous window = 
> [1498383077000 ms, 1498383078000 ms]
> 17/06/25 17:31:26 INFO ShuffledDStream: Slicing from 1498383077000 ms to 
> 1498383084000 ms (aligned to 1498383077000 ms and 1498383084000 ms)
> 17/06/25 17:31:26 INFO ShuffledDStream: Time 1498383078000 ms is invalid as 
> zeroTime is 1498383078000 ms , slideDuration is 1000 ms and difference is 0 ms
> 17/06/25 17:31:26 DEBUG ShuffledDStream: Time 1498383079000 ms is valid
> 17/06/25 17:31:26 DEBUG MappedDStream: Time 1498383079000 ms is valid
> the slice time is wrong.
> [BTW]: Team members,
> If it was a bug, please don't fix it.I try to fix it myself.Thanks:)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21206) the window slice of Dstream is wrong

2017-07-01 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16071105#comment-16071105
 ] 

Sean Owen commented on SPARK-21206:
---

I'm still not clear what you're saying, even after formatting the code and 
trying to read comments. It appears to be logging the window size, but that's 
correct.

> the window slice of Dstream is wrong
> 
>
> Key: SPARK-21206
> URL: https://issues.apache.org/jira/browse/SPARK-21206
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.1.0
>Reporter: Fei Shao
>
> the code is :
> val conf = new SparkConf().setAppName("testDstream").setMaster("local[4]")
> val ssc = new StreamingContext(conf, Seconds(1))
> ssc.checkpoint( "path")
> val lines = ssc.socketTextStream("IP", PORT)
> lines.countByValueAndWindow( Seconds(2), Seconds(8)).foreachRDD( s => {
>   println( "RDD ID IS : " + s.id)
>   s.foreach( e => println("data is " + e._1 + " :" + e._2))
>   println()
> })
> The result is wrong. 
> I checked the log, it showed:
> 17/06/25 17:31:26 DEBUG ReducedWindowedDStream: Time 1498383086000 ms is valid
> 17/06/25 17:31:26 DEBUG ReducedWindowedDStream: Window time = 2000 ms
> 17/06/25 17:31:26 DEBUG ReducedWindowedDStream: Slide time = 8000 ms
> 17/06/25 17:31:26 DEBUG ReducedWindowedDStream: Zero time = 1498383078000 ms
> 17/06/25 17:31:26 DEBUG ReducedWindowedDStream: Current window = 
> [1498383085000 ms, 1498383086000 ms]
> 17/06/25 17:31:26 DEBUG ReducedWindowedDStream: Previous window = 
> [1498383077000 ms, 1498383078000 ms]
> 17/06/25 17:31:26 INFO ShuffledDStream: Slicing from 1498383077000 ms to 
> 1498383084000 ms (aligned to 1498383077000 ms and 1498383084000 ms)
> 17/06/25 17:31:26 INFO ShuffledDStream: Time 1498383078000 ms is invalid as 
> zeroTime is 1498383078000 ms , slideDuration is 1000 ms and difference is 0 ms
> 17/06/25 17:31:26 DEBUG ShuffledDStream: Time 1498383079000 ms is valid
> 17/06/25 17:31:26 DEBUG MappedDStream: Time 1498383079000 ms is valid
> the slice time is wrong.
> [BTW]: Team members,
> If it was a bug, please don't fix it.I try to fix it myself.Thanks:)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21206) the window slice of Dstream is wrong

2017-07-01 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16062284#comment-16062284
 ] 

Sean Owen edited comment on SPARK-21206 at 7/1/17 8:40 AM:
---

[~srowen]

I am sorry, I did not give enough message about this issue.

For my test code:

{code}
lines.countByValueAndWindow( Seconds(2), Seconds(8)).foreachRDD( s => { 
《=== here the windowDuration is 2 seconds and the slideDuration is 8 seconds. 


===log begin 
17/06/25 17:31:26 DEBUG ReducedWindowedDStream: Current window = [1498383085000 
ms, 1498383086000 ms]
17/06/25 17:31:26 DEBUG ReducedWindowedDStream: Previous window = 
[1498383077000 ms, 1498383078000 ms]
17/06/25 17:31:26 INFO ShuffledDStream: Slicing from 1498383077000 ms to 
1498383084000 ms (aligned to 1498383077000 ms and 1498383084000 ms)  《=== here, 
the old RDD  slices from 1498383077000  to 1498383084000 . It is 8 seconds. 
Actual it should be 2 seconds.
===log end

===the code in ReducedWindowedDStream.scala begin
  override def compute(validTime: Time): Option[RDD[(K, V)]] = {
val reduceF = reduceFunc
val invReduceF = invReduceFunc

val currentTime = validTime
val currentWindow = new Interval(currentTime - windowDuration + 
parent.slideDuration,
  currentTime)
val previousWindow = currentWindow - slideDuration

logDebug("Window time = " + windowDuration)
logDebug("Slide time = " + slideDuration)
logDebug("Zero time = " + zeroTime)
logDebug("Current window = " + currentWindow)
logDebug("Previous window = " + previousWindow)

//  _
// |  previous window   _|___
// |___|   current window|  --> Time
// |_|
//
// | _|  | _|
//  | |
//  V V
//   old RDDs new RDDs
//

// Get the RDDs of the reduced values in "old time steps"
val oldRDDs =
  reducedStream.slice(previousWindow.beginTime, currentWindow.beginTime - 
parent.slideDuration) 

// Get the RDDs of the reduced values in "new time steps"
val newRDDs =
  reducedStream.slice(previousWindow.endTime + parent.slideDuration, 
currentWindow.endTime)
logDebug("# new RDDs = " + newRDDs.size)

===code in ReducedWindowedDStream.scala  end

Image the below case(  The slideDuration is greater than windowDuration. ):
//  __
// |  previous window|   _
// |__|--|current window|  
--> Time
//   
||
//
// ___  ___|  | |
//   |   |
//   V V
// old RDDs   new RDDs
{code}

I think we can change expressions about oldRDDs and newRDDs to fix this issue.



was (Author: robin shao):
[~srowen]

I am sorry, I did not give enough message about this issue.

For my test code:

lines.countByValueAndWindow( Seconds(2), Seconds(8)).foreachRDD( s => { 
《=== here the windowDuration is 2 seconds and the slideDuration is 8 seconds. 


===log begin 
17/06/25 17:31:26 DEBUG ReducedWindowedDStream: Current window = [1498383085000 
ms, 1498383086000 ms]
17/06/25 17:31:26 DEBUG ReducedWindowedDStream: Previous window = 
[1498383077000 ms, 1498383078000 ms]
17/06/25 17:31:26 INFO ShuffledDStream: Slicing from 1498383077000 ms to 
1498383084000 ms (aligned to 1498383077000 ms and 1498383084000 ms)  《=== here, 
the old RDD  slices from 1498383077000  to 1498383084000 . It is 8 seconds. 
Actual it should be 2 seconds.
===log end

===the code in ReducedWindowedDStream.scala begin
  override def compute(validTime: Time): Option[RDD[(K, V)]] = {
val reduceF = reduceFunc
val invReduceF = invReduceFunc

val currentTime = validTime
val currentWindow = new Interval(currentTime - windowDuration + 
parent.slideDuration,
  currentTime)
val previousWindow = currentWindow - slideDuration

logDebug("Window time = " + windowDuration)
logDebug("Slide time = " + slideDuration)
logDebug("Zero time = " + zeroTime)
logDebug("Current window = " + currentWindow)
logDebug("Previous window = " + previousWindow)

//  _
// |  previous window   _|___
// |___|   current window|  

[jira] [Comment Edited] (SPARK-21206) the window slice of Dstream is wrong

2017-07-01 Thread Fei Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16062284#comment-16062284
 ] 

Fei Shao edited comment on SPARK-21206 at 7/1/17 8:38 AM:
--

[~srowen]

I am sorry, I did not give enough message about this issue.

For my test code:

lines.countByValueAndWindow( Seconds(2), Seconds(8)).foreachRDD( s => { 
《=== here the windowDuration is 2 seconds and the slideDuration is 8 seconds. 


===log begin 
17/06/25 17:31:26 DEBUG ReducedWindowedDStream: Current window = [1498383085000 
ms, 1498383086000 ms]
17/06/25 17:31:26 DEBUG ReducedWindowedDStream: Previous window = 
[1498383077000 ms, 1498383078000 ms]
17/06/25 17:31:26 INFO ShuffledDStream: Slicing from 1498383077000 ms to 
1498383084000 ms (aligned to 1498383077000 ms and 1498383084000 ms)  《=== here, 
the old RDD  slices from 1498383077000  to 1498383084000 . It is 8 seconds. 
Actual it should be 2 seconds.
===log end

===the code in ReducedWindowedDStream.scala begin
  override def compute(validTime: Time): Option[RDD[(K, V)]] = {
val reduceF = reduceFunc
val invReduceF = invReduceFunc

val currentTime = validTime
val currentWindow = new Interval(currentTime - windowDuration + 
parent.slideDuration,
  currentTime)
val previousWindow = currentWindow - slideDuration

logDebug("Window time = " + windowDuration)
logDebug("Slide time = " + slideDuration)
logDebug("Zero time = " + zeroTime)
logDebug("Current window = " + currentWindow)
logDebug("Previous window = " + previousWindow)

//  _
// |  previous window   _|___
// |___|   current window|  --> Time
// |_|
//
// | _|  | _|
//  | |
//  V V
//   old RDDs new RDDs
//

// Get the RDDs of the reduced values in "old time steps"
val oldRDDs =
  reducedStream.slice(previousWindow.beginTime, currentWindow.beginTime - 
parent.slideDuration) 

// Get the RDDs of the reduced values in "new time steps"
val newRDDs =
  reducedStream.slice(previousWindow.endTime + parent.slideDuration, 
currentWindow.endTime)
logDebug("# new RDDs = " + newRDDs.size)

===code in ReducedWindowedDStream.scala  end

Image the below case(  The slideDuration is greater than windowDuration. ):
//  __
// |  previous window|   _
// |__|--|current window|  
--> Time
//   
||
//
// ___  ___|  | |
//   |   |
//   V V
// old RDDs   new RDDs

I think we can change expressions about oldRDDs and newRDDs to fix this issue.



was (Author: robin shao):
[~srowen]

I am sorry, I did not give enough message about this issue.

For my test code:

lines.countByValueAndWindow( Seconds(2), Seconds(8)).foreachRDD( s => { 
《=== here the windowDuration is 2 seconds and the slideDuration is 8 seconds. 


===log begin 
17/06/25 17:31:26 DEBUG ReducedWindowedDStream: Current window = [1498383085000 
ms, 1498383086000 ms]
17/06/25 17:31:26 DEBUG ReducedWindowedDStream: Previous window = 
[1498383077000 ms, 1498383078000 ms]
17/06/25 17:31:26 INFO ShuffledDStream: Slicing from 1498383077000 ms to 
1498383084000 ms (aligned to 1498383077000 ms and 1498383084000 ms)  《=== here, 
the old RDD  slices from 1498383077000  to 1498383084000 . It is 8 seconds. 
Actual it should be 2 seconds.
===log end

===the code in ReducedWindowedDStream.scala begin
  override def compute(validTime: Time): Option[RDD[(K, V)]] = {
val reduceF = reduceFunc
val invReduceF = invReduceFunc

val currentTime = validTime
val currentWindow = new Interval(currentTime - windowDuration + 
parent.slideDuration,
  currentTime)
val previousWindow = currentWindow - slideDuration

logDebug("Window time = " + windowDuration)
logDebug("Slide time = " + slideDuration)
logDebug("Zero time = " + zeroTime)
logDebug("Current window = " + currentWindow)
logDebug("Previous window = " + previousWindow)

//  _
// |  previous window   _|___
// |___|   current window|  

[jira] [Comment Edited] (SPARK-21206) the window slice of Dstream is wrong

2017-07-01 Thread Fei Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16062284#comment-16062284
 ] 

Fei Shao edited comment on SPARK-21206 at 7/1/17 8:30 AM:
--

[~srowen]

I am sorry, I did not give enough message about this issue.

For my test code:

lines.countByValueAndWindow( Seconds(2), Seconds(8)).foreachRDD( s => { 
《=== here the windowDuration is 2 seconds and the slideDuration is 8 seconds. 


===log begin 
17/06/25 17:31:26 DEBUG ReducedWindowedDStream: Current window = [1498383085000 
ms, 1498383086000 ms]
17/06/25 17:31:26 DEBUG ReducedWindowedDStream: Previous window = 
[1498383077000 ms, 1498383078000 ms]
17/06/25 17:31:26 INFO ShuffledDStream: Slicing from 1498383077000 ms to 
1498383084000 ms (aligned to 1498383077000 ms and 1498383084000 ms)  《=== here, 
the old RDD  slices from 1498383077000  to 1498383084000 . It is 8 seconds. 
Actual it should be 2 seconds.
===log end

===the code in ReducedWindowedDStream.scala begin
  override def compute(validTime: Time): Option[RDD[(K, V)]] = {
val reduceF = reduceFunc
val invReduceF = invReduceFunc

val currentTime = validTime
val currentWindow = new Interval(currentTime - windowDuration + 
parent.slideDuration,
  currentTime)
val previousWindow = currentWindow - slideDuration

logDebug("Window time = " + windowDuration)
logDebug("Slide time = " + slideDuration)
logDebug("Zero time = " + zeroTime)
logDebug("Current window = " + currentWindow)
logDebug("Previous window = " + previousWindow)

//  _
// |  previous window   _|___
// |___|   current window|  --> Time
// |_|
//
// | _|  | _|
//  | |
//  V V
//   old RDDs new RDDs
//

// Get the RDDs of the reduced values in "old time steps"
val oldRDDs =
  reducedStream.slice(previousWindow.beginTime, currentWindow.beginTime - 
parent.slideDuration) 

// Get the RDDs of the reduced values in "new time steps"
val newRDDs =
  reducedStream.slice(previousWindow.endTime + parent.slideDuration, 
currentWindow.endTime)
logDebug("# new RDDs = " + newRDDs.size)

===code in ReducedWindowedDStream.scala  end

Image the below case(  The slideDuration is greater than windowDuration. ):
//  _
// |  previous window|   
// |_|--|current window  |  
--> Time
//  ||
//
// |  ___|  | ___|
//   ||
//   VV
// old RDDs new RDDs

I think we can change expressions about oldRDDs and newRDDs to fix this issue.



was (Author: robin shao):
Hi  Sean Owen,

I am sorry, I did not give enough message about this issue.

For my test code:

lines.countByValueAndWindow( Seconds(2), Seconds(8)).foreachRDD( s => { 
《=== here the windowDuration is 2 seconds and the slideDuration is 8 seconds. 


===log begin 
17/06/25 17:31:26 DEBUG ReducedWindowedDStream: Current window = [1498383085000 
ms, 1498383086000 ms]
17/06/25 17:31:26 DEBUG ReducedWindowedDStream: Previous window = 
[1498383077000 ms, 1498383078000 ms]
17/06/25 17:31:26 INFO ShuffledDStream: Slicing from 1498383077000 ms to 
1498383084000 ms (aligned to 1498383077000 ms and 1498383084000 ms)  《=== here, 
the old RDD  slices from 1498383077000  to 1498383084000 . It is 8 seconds. 
Actual it should be 2 seconds.
===log end

===code in ReducedWindowedDStream.scala begin
  override def compute(validTime: Time): Option[RDD[(K, V)]] = {
val reduceF = reduceFunc
val invReduceF = invReduceFunc

val currentTime = validTime
val currentWindow = new Interval(currentTime - windowDuration + 
parent.slideDuration,
  currentTime)
val previousWindow = currentWindow - slideDuration

logDebug("Window time = " + windowDuration)
logDebug("Slide time = " + slideDuration)
logDebug("Zero time = " + zeroTime)
logDebug("Current window = " + currentWindow)
logDebug("Previous window = " + previousWindow)

//  _
// |  previous window   _|___
// |___|   current window|  --> Time
// 

[jira] [Commented] (SPARK-4131) Support "Writing data into the filesystem from queries"

2017-07-01 Thread Santhavathi S (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16071073#comment-16071073
 ] 

Santhavathi S commented on SPARK-4131:
--

The requirement is to have this available in SQL.

> Support "Writing data into the filesystem from queries"
> ---
>
> Key: SPARK-4131
> URL: https://issues.apache.org/jira/browse/SPARK-4131
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 1.1.0
>Reporter: XiaoJing wang
>Assignee: Fei Wang
>Priority: Critical
>   Original Estimate: 0.05h
>  Remaining Estimate: 0.05h
>
> Writing data into the filesystem from queries,SparkSql is not support .
> eg:
> {code}insert overwrite LOCAL DIRECTORY '/data1/wangxj/sql_spark' select * 
> from page_views;
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20602) Adding LBFGS optimizer and Squared_hinge loss for LinearSVC

2017-07-01 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16071065#comment-16071065
 ] 

Yanbo Liang commented on SPARK-20602:
-

I'd like to shepherd this task and target this for 2.3. Thanks.

> Adding LBFGS optimizer and Squared_hinge loss for LinearSVC
> ---
>
> Key: SPARK-20602
> URL: https://issues.apache.org/jira/browse/SPARK-20602
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: yuhao yang
>Assignee: yuhao yang
>
> Currently LinearSVC in Spark only supports OWLQN as the optimizer ( check 
> https://issues.apache.org/jira/browse/SPARK-14709). I made comparison between 
> LBFGS and OWLQN on several public dataset and found LBFGS converges much 
> faster for LinearSVC in most cases.
> The following table presents the number of training iterations and f1 score 
> of both optimizers until convergence
> ||Dataset||LBFGS with hinge||OWLQN with hinge||LBFGS with squared_hinge||
> |news20.binary| 31 (0.99) | 413(0.99) |  185 (0.99) |
> |mushroom| 28(1.0) | 170(1.0)| 24(1.0) |
> |madelon|143(0.75) | 8129(0.70)| 823(0.74) |
> |breast-cancer-scale| 15(1.0) | 16(1.0)| 15 (1.0) |
> |phishing | 329(0.94) | 231(0.94) | 67 (0.94) |
> |a1a(adult) | 466 (0.87) | 282 (0.87) | 77 (0.86) |
> |a7a | 237 (0.84) | 372(0.84) | 69(0.84) |
> data source: 
> https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html
> training code: new LinearSVC().setMaxIter(1).setTol(1e-6)
> LBFGS requires less iterations in most cases (except for a1a) and probably is 
> a better default optimizer. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20602) Adding LBFGS optimizer and Squared_hinge loss for LinearSVC

2017-07-01 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-20602:

Shepherd: Yanbo Liang
Target Version/s: 2.3.0

> Adding LBFGS optimizer and Squared_hinge loss for LinearSVC
> ---
>
> Key: SPARK-20602
> URL: https://issues.apache.org/jira/browse/SPARK-20602
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: yuhao yang
>Assignee: yuhao yang
>
> Currently LinearSVC in Spark only supports OWLQN as the optimizer ( check 
> https://issues.apache.org/jira/browse/SPARK-14709). I made comparison between 
> LBFGS and OWLQN on several public dataset and found LBFGS converges much 
> faster for LinearSVC in most cases.
> The following table presents the number of training iterations and f1 score 
> of both optimizers until convergence
> ||Dataset||LBFGS with hinge||OWLQN with hinge||LBFGS with squared_hinge||
> |news20.binary| 31 (0.99) | 413(0.99) |  185 (0.99) |
> |mushroom| 28(1.0) | 170(1.0)| 24(1.0) |
> |madelon|143(0.75) | 8129(0.70)| 823(0.74) |
> |breast-cancer-scale| 15(1.0) | 16(1.0)| 15 (1.0) |
> |phishing | 329(0.94) | 231(0.94) | 67 (0.94) |
> |a1a(adult) | 466 (0.87) | 282 (0.87) | 77 (0.86) |
> |a7a | 237 (0.84) | 372(0.84) | 69(0.84) |
> data source: 
> https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html
> training code: new LinearSVC().setMaxIter(1).setTol(1e-6)
> LBFGS requires less iterations in most cases (except for a1a) and probably is 
> a better default optimizer. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20602) Adding LBFGS optimizer and Squared_hinge loss for LinearSVC

2017-07-01 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang reassigned SPARK-20602:
---

Assignee: yuhao yang

> Adding LBFGS optimizer and Squared_hinge loss for LinearSVC
> ---
>
> Key: SPARK-20602
> URL: https://issues.apache.org/jira/browse/SPARK-20602
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: yuhao yang
>Assignee: yuhao yang
>
> Currently LinearSVC in Spark only supports OWLQN as the optimizer ( check 
> https://issues.apache.org/jira/browse/SPARK-14709). I made comparison between 
> LBFGS and OWLQN on several public dataset and found LBFGS converges much 
> faster for LinearSVC in most cases.
> The following table presents the number of training iterations and f1 score 
> of both optimizers until convergence
> ||Dataset||LBFGS with hinge||OWLQN with hinge||LBFGS with squared_hinge||
> |news20.binary| 31 (0.99) | 413(0.99) |  185 (0.99) |
> |mushroom| 28(1.0) | 170(1.0)| 24(1.0) |
> |madelon|143(0.75) | 8129(0.70)| 823(0.74) |
> |breast-cancer-scale| 15(1.0) | 16(1.0)| 15 (1.0) |
> |phishing | 329(0.94) | 231(0.94) | 67 (0.94) |
> |a1a(adult) | 466 (0.87) | 282 (0.87) | 77 (0.86) |
> |a7a | 237 (0.84) | 372(0.84) | 69(0.84) |
> data source: 
> https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html
> training code: new LinearSVC().setMaxIter(1).setTol(1e-6)
> LBFGS requires less iterations in most cases (except for a1a) and probably is 
> a better default optimizer. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21275) Update GLM test to use supportedFamilyNames

2017-07-01 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang resolved SPARK-21275.
-
   Resolution: Fixed
 Assignee: Wayne Zhang
Fix Version/s: 2.3.0

> Update GLM test to use supportedFamilyNames
> ---
>
> Key: SPARK-21275
> URL: https://issues.apache.org/jira/browse/SPARK-21275
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.1.1
>Reporter: Wayne Zhang
>Assignee: Wayne Zhang
>Priority: Minor
> Fix For: 2.3.0
>
>
> Address this comment: 
> https://github.com/apache/spark/pull/16699#discussion-diff-100574976R855



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org