date:20151212

[jira] [Updated] (SPARK-12309) Use sqlContext from MLlibTestSparkContext for spark.ml test suites

2015-12-12 Thread Yanbo Liang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-12309:

Description: Use sqlContext from MLlibTestSparkContext rather than creating 
new one for spark.ml test cases.  (was: Use sqlContext from 
MLlibTestSparkContext rather than creating new one for each spark.ml test 
cases.)

> Use sqlContext from MLlibTestSparkContext for spark.ml test suites
> --
>
> Key: SPARK-12309
> URL: https://issues.apache.org/jira/browse/SPARK-12309
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Yanbo Liang
>
> Use sqlContext from MLlibTestSparkContext rather than creating new one for 
> spark.ml test cases.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12309) Use sqlContext from MLlibTestSparkContext for spark.ml test suites

2015-12-12 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12309:


Assignee: (was: Apache Spark)

> Use sqlContext from MLlibTestSparkContext for spark.ml test suites
> --
>
> Key: SPARK-12309
> URL: https://issues.apache.org/jira/browse/SPARK-12309
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Yanbo Liang
>
> Use sqlContext from MLlibTestSparkContext rather than creating new one for 
> each spark.ml test cases.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12309) Use sqlContext from MLlibTestSparkContext for spark.ml test suites

2015-12-12 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12309:


Assignee: Apache Spark

> Use sqlContext from MLlibTestSparkContext for spark.ml test suites
> --
>
> Key: SPARK-12309
> URL: https://issues.apache.org/jira/browse/SPARK-12309
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Yanbo Liang
>Assignee: Apache Spark
>
> Use sqlContext from MLlibTestSparkContext rather than creating new one for 
> each spark.ml test cases.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12309) Use sqlContext from MLlibTestSparkContext for spark.ml test suites

2015-12-12 Thread Yanbo Liang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-12309:

Description: Use sqlContext from MLlibTestSparkContext rather than creating 
new one for each spark.ml test cases.  (was: Use sqlContext from 
MLlibTestSparkContext for spark.ml test suites)

> Use sqlContext from MLlibTestSparkContext for spark.ml test suites
> --
>
> Key: SPARK-12309
> URL: https://issues.apache.org/jira/browse/SPARK-12309
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Yanbo Liang
>
> Use sqlContext from MLlibTestSparkContext rather than creating new one for 
> each spark.ml test cases.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-12309) Use sqlContext from MLlibTestSparkContext for spark.ml test suites

2015-12-12 Thread Yanbo Liang (JIRA)

Yanbo Liang created SPARK-12309:
---

 Summary: Use sqlContext from MLlibTestSparkContext for spark.ml 
test suites
 Key: SPARK-12309
 URL: https://issues.apache.org/jira/browse/SPARK-12309
 Project: Spark
  Issue Type: Improvement
  Components: ML
Reporter: Yanbo Liang


Use sqlContext from MLlibTestSparkContext for spark.ml test suites



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12062) Master rebuilding historical SparkUI should be asynchronous

2015-12-12 Thread Andrew Or (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15054829#comment-15054829
 ] 

Andrew Or commented on SPARK-12062:
---

I see, if you already have a patch then this is worth fixing. However I don't 
think we should introduce yet another configuration. It's best if the serving 
is asynchronous before we remove it completely.

> Master rebuilding historical SparkUI should be asynchronous
> ---
>
> Key: SPARK-12062
> URL: https://issues.apache.org/jira/browse/SPARK-12062
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 1.0.0
>Reporter: Andrew Or
>Assignee: Bryan Cutler
>
> When a long-running application finishes, it takes a while (sometimes 
> minutes) to rebuild the SparkUI. However, in Master.scala this is currently 
> done within the RPC event loop, which runs only in 1 thread. Thus, in the 
> mean time no other applications can register with this master.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-12308) Better onDisconnected logic for Master

2015-12-12 Thread Shixiong Zhu (JIRA)

Shixiong Zhu created SPARK-12308:


 Summary: Better onDisconnected logic for Master
 Key: SPARK-12308
 URL: https://issues.apache.org/jira/browse/SPARK-12308
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Shixiong Zhu


Pasting [~vanzin]'s comment from 
GitHub(https://github.com/apache/spark/pull/10261#issuecomment-164107825):

bq. That's because we took opposite approaches. Your PR's approach is "if this 
connection says the sender address is a listening socket, then also consider 
that address when sending events about the remote process".

bq. My PR takes the opposite approach: the address of the remote process is 
always the address of the socket used to connect, regardless of whether its 
also listening in another socket.

bq. I think my approach is in the end more correct, but requires more code to 
fix existing code. In my view, RpcCallContext.senderAddress should be the 
address of the socket that sent the message, not the address the remote process 
is listening on.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-12267) Standalone master keeps references to disassociated workers until they sent no heartbeats

2015-12-12 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-12267.
--
   Resolution: Fixed
 Assignee: Shixiong Zhu
Fix Version/s: 2.0.0
   1.6.1

Set 1.6.1 as one of fix versions. If there is RC3, should change it to 1.6.0.

> Standalone master keeps references to disassociated workers until they sent 
> no heartbeats
> -
>
> Key: SPARK-12267
> URL: https://issues.apache.org/jira/browse/SPARK-12267
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Jacek Laskowski
>Assignee: Shixiong Zhu
> Fix For: 1.6.1, 2.0.0
>
>
> While toying with Spark Standalone I've noticed the following messages
> in the logs of the master:
> {code}
> INFO Master: Registering worker 192.168.1.6:59919 with 2 cores, 2.0 GB RAM
> INFO Master: localhost:59920 got disassociated, removing it.
> ...
> WARN Master: Removing worker-20151210090708-192.168.1.6-59919 because
> we got no heartbeat in 60 seconds
> INFO Master: Removing worker worker-20151210090708-192.168.1.6-59919
> on 192.168.1.6:59919
> {code}
> Why does the message "WARN Master: Removing
> worker-20151210090708-192.168.1.6-59919 because we got no heartbeat in
> 60 seconds" appear when the worker should've been removed already (as
> pointed out in "INFO Master: localhost:59920 got disassociated,
> removing it.")?
> Could it be that the ids are different - 192.168.1.6:59919 vs localhost:59920?
> I started master using {{./sbin/start-master.sh -h localhost}} and the
> workers {{./sbin/start-slave.sh spark://localhost:7077}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-12199) Follow-up: Refine example code in ml-features.md

2015-12-12 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-12199.
---
   Resolution: Fixed
Fix Version/s: 1.6.1
   2.0.0

Issue resolved by pull request 10193
[https://github.com/apache/spark/pull/10193]

> Follow-up: Refine example code in ml-features.md
> 
>
> Key: SPARK-12199
> URL: https://issues.apache.org/jira/browse/SPARK-12199
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Reporter: Xusen Yin
>Assignee: Xusen Yin
>  Labels: starter
> Fix For: 2.0.0, 1.6.1
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-12307) ParquetFormat options should be exposed through the DataFrameReader/Writer options API

2015-12-12 Thread holdenk (JIRA)

holdenk created SPARK-12307:
---

 Summary: ParquetFormat options should be exposed through the 
DataFrameReader/Writer options API
 Key: SPARK-12307
 URL: https://issues.apache.org/jira/browse/SPARK-12307
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: holdenk
Priority: Trivial


Currently many options for loading/saving Parquet need to be set globally on 
the SparkContext. It would be useful to also provide support for setting these 
options through the DataFrameReader/DataFrameWriter.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7425) spark.ml Predictor should support other numeric types for label

2015-12-12 Thread Benjamin Fradet (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15054664#comment-15054664
 ] 

Benjamin Fradet commented on SPARK-7425:


Is there anyone working on this?
Because I'm considering taking over this jira.

I started writing some unit tests for a few predictors and I'm wondering if I 
should write unit tests for all the predictors?
Input welcome.

> spark.ml Predictor should support other numeric types for label
> ---
>
> Key: SPARK-7425
> URL: https://issues.apache.org/jira/browse/SPARK-7425
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Minor
>  Labels: starter
>
> Currently, the Predictor abstraction expects the input labelCol type to be 
> DoubleType, but we should support other numeric types.  This will involve 
> updating the PredictorParams.validateAndTransformSchema method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-9686) Spark Thrift server doesn't return correct JDBC metadata

2015-12-12 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-9686:

Target Version/s:   (was: 1.6.0)

> Spark Thrift server doesn't return correct JDBC metadata 
> -
>
> Key: SPARK-9686
> URL: https://issues.apache.org/jira/browse/SPARK-9686
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0, 1.4.1, 1.5.0, 1.5.1, 1.5.2
>Reporter: pin_zhang
>Assignee: Cheng Lian
>Priority: Critical
> Attachments: SPARK-9686.1.patch.txt
>
>
> 1. Start  start-thriftserver.sh
> 2. connect with beeline
> 3. create table
> 4.show tables, the new created table returned
> 5.
>   Class.forName("org.apache.hive.jdbc.HiveDriver");
>   String URL = "jdbc:hive2://localhost:1/default";
>Properties info = new Properties();
> Connection conn = DriverManager.getConnection(URL, info);
>   ResultSet tables = conn.getMetaData().getTables(conn.getCatalog(),
>null, null, null);
> Problem:
>No tables with returned this API, that work in spark1.3



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-11785) When deployed against remote Hive metastore with lower versions, JDBC metadata calls throws exception

2015-12-12 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-11785:
-
Target Version/s:   (was: 1.6.0)

> When deployed against remote Hive metastore with lower versions, JDBC 
> metadata calls throws exception
> -
>
> Key: SPARK-11785
> URL: https://issues.apache.org/jira/browse/SPARK-11785
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1, 1.6.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>Priority: Critical
>
> To reproduce this issue with 1.7-SNAPSHOT
> # Start Hive 0.13.1 metastore service using {{$HIVE_HOME/bin/hive --service 
> metastore}}
> # Configures remote Hive metastore in {{conf/hive-site.xml}} by pointing 
> {{hive.metastore.uris}} to metastore endpoint (e.g. 
> {{thrift://localhost:9083}})
> # Set {{spark.sql.hive.metastore.version}} to {{0.13.1}} and 
> {{spark.sql.hive.metastore.jars}} to {{maven}} in {{conf/spark-defaults.conf}}
> # Start Thrift server using {{$SPARK_HOME/sbin/start-thriftserver.sh}}
> # Run the testing JDBC client program attached at the end
> Exception thrown from client side:
> {noformat}
> java.sql.SQLException: Could not create ResultSet: Required field 
> 'operationHandle' is unset! 
> Struct:TGetResultSetMetadataReq(operationHandle:null)
> java.sql.SQLException: Could not create ResultSet: Required field 
> 'operationHandle' is unset! 
> Struct:TGetResultSetMetadataReq(operationHandle:null)
> at 
> org.apache.hive.jdbc.HiveQueryResultSet.retrieveSchema(HiveQueryResultSet.java:273)
> at 
> org.apache.hive.jdbc.HiveQueryResultSet.(HiveQueryResultSet.java:188)
> at 
> org.apache.hive.jdbc.HiveQueryResultSet$Builder.build(HiveQueryResultSet.java:170)
> at 
> org.apache.hive.jdbc.HiveDatabaseMetaData.getColumns(HiveDatabaseMetaData.java:222)
> at JDBCExperiments$.main(JDBCExperiments.scala:28)
> at JDBCExperiments.main(JDBCExperiments.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> Caused by: org.apache.thrift.protocol.TProtocolException: Required field 
> 'operationHandle' is unset! 
> Struct:TGetResultSetMetadataReq(operationHandle:null)
> at 
> org.apache.hive.service.cli.thrift.TGetResultSetMetadataReq.validate(TGetResultSetMetadataReq.java:290)
> at 
> org.apache.hive.service.cli.thrift.TCLIService$GetResultSetMetadata_args.validate(TCLIService.java:12041)
> at 
> org.apache.hive.service.cli.thrift.TCLIService$GetResultSetMetadata_args$GetResultSetMetadata_argsStandardScheme.write(TCLIService.java:12098)
> at 
> org.apache.hive.service.cli.thrift.TCLIService$GetResultSetMetadata_args$GetResultSetMetadata_argsStandardScheme.write(TCLIService.java:12067)
> at 
> org.apache.hive.service.cli.thrift.TCLIService$GetResultSetMetadata_args.write(TCLIService.java:12018)
> at org.apache.thrift.TServiceClient.sendBase(TServiceClient.java:63)
> at 
> org.apache.hive.service.cli.thrift.TCLIService$Client.send_GetResultSetMetadata(TCLIService.java:472)
> at 
> org.apache.hive.service.cli.thrift.TCLIService$Client.GetResultSetMetadata(TCLIService.java:464)
> at 
> org.apache.hive.jdbc.HiveQueryResultSet.retrieveSchema(HiveQueryResultSet.java:242)
> at 
> org.apache.hive.jdbc.HiveQueryResultSet.(HiveQueryResultSet.java:188)
> at 
> org.apache.hive.jdbc.HiveQueryResultSet$Builder.build(HiveQueryResultSet.java:170)
> at 
> org.apache.hive.jdbc.HiveDatabaseMetaData.getColumns(HiveDatabaseMetaData.java:222)
> at JDBCExperiments$.main(JDBCExperiments.scala:28)
> at JDBCExperiments.main(JDBCExperiments.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> {noformat}
> Exception thrown from server side:
> {noformat}
> 15/11/18 02:27:01 WARN RetryingMetaStoreClient: MetaStoreClient lost 
> connection. Attempting to reconnect.
> org.apache.thrift.TApplicationException: Invalid method name: 
> 'get_schema_with_environment_context'
> at 
> org.apache.thrift.TApplicationException.read(TApplicationException.java:111)
> at 
> org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:71)
> at

[jira] [Updated] (SPARK-12199) Follow-up: Refine example code in ml-features.md

2015-12-12 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-12199:
--
Shepherd: Joseph K. Bradley  (was: Xiangrui Meng)

> Follow-up: Refine example code in ml-features.md
> 
>
> Key: SPARK-12199
> URL: https://issues.apache.org/jira/browse/SPARK-12199
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Reporter: Xusen Yin
>Assignee: Xusen Yin
>  Labels: starter
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12218) Boolean logic in sql does not work "not (A and B)" is not the same as "(not A) or (not B)"

2015-12-12 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12218:


Assignee: Apache Spark

> Boolean logic in sql does not work  "not (A and B)" is not the same as  "(not 
> A) or (not B)"
> 
>
> Key: SPARK-12218
> URL: https://issues.apache.org/jira/browse/SPARK-12218
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Irakli Machabeli
>Assignee: Apache Spark
>Priority: Blocker
>
> Two identical queries produce different results
> In [2]: sqlContext.read.parquet('prp_enh1').where(" LoanID=62231 and not( 
> PaymentsReceived=0 and ExplicitRoll in ('PreviouslyPaidOff', 
> 'PreviouslyChargedOff'))").count()
> Out[2]: 18
> In [3]: sqlContext.read.parquet('prp_enh1').where(" LoanID=62231 and ( 
> not(PaymentsReceived=0) or not (ExplicitRoll in ('PreviouslyPaidOff', 
> 'PreviouslyChargedOff')))").count()
> Out[3]: 28



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12218) Boolean logic in sql does not work "not (A and B)" is not the same as "(not A) or (not B)"

2015-12-12 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12218:


Assignee: (was: Apache Spark)

> Boolean logic in sql does not work  "not (A and B)" is not the same as  "(not 
> A) or (not B)"
> 
>
> Key: SPARK-12218
> URL: https://issues.apache.org/jira/browse/SPARK-12218
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Irakli Machabeli
>Priority: Blocker
>
> Two identical queries produce different results
> In [2]: sqlContext.read.parquet('prp_enh1').where(" LoanID=62231 and not( 
> PaymentsReceived=0 and ExplicitRoll in ('PreviouslyPaidOff', 
> 'PreviouslyChargedOff'))").count()
> Out[2]: 18
> In [3]: sqlContext.read.parquet('prp_enh1').where(" LoanID=62231 and ( 
> not(PaymentsReceived=0) or not (ExplicitRoll in ('PreviouslyPaidOff', 
> 'PreviouslyChargedOff')))").count()
> Out[3]: 28



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12303) Configuration parameter by which can choose if we want the REPL generated class directory name to be random or fixed name.

2015-12-12 Thread Neelesh Srinivas Salian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15054515#comment-15054515
 ] 

Neelesh Srinivas Salian commented on SPARK-12303:
-

Hi piyush,

Thanks for opening the JIRA.
Going by your description, I see you intend to change:
{code}
private val SPARK_DEBUG_REPL: Boolean = (System.getenv("SPARK_DEBUG_REPL") == 
"1")
/** Local directory to save .class files too */
private lazy val outputDir = {
  val tmp = System.getProperty("java.io.tmpdir")
  val rootDir = conf.get("spark.repl.classdir",  tmp)
  Utils.createTempDir(rootDir)
}
if (SPARK_DEBUG_REPL) {
  echo("Output directory: " + outputDir)
} 
{code}
Could you please elaborate on what the benefit of having that option would do? 
It will create the directory based on your conf settings.

 

> Configuration parameter by which  can choose if we want the REPL generated 
> class directory name to be random or fixed name.
> ---
>
> Key: SPARK-12303
> URL: https://issues.apache.org/jira/browse/SPARK-12303
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Shell
>Reporter: piyush
>Priority: Minor
>
>  .class generated by spark REPL are stored in a temp directory with random 
> name.
> Configuration parameter by which  can choose if we want the REPL generated 
> class directory name to be random or fixed name.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4621) Shuffle index can be cached for SortShuffleManager in ExternalShuffle in order to reduce indexFile's io

2015-12-12 Thread Lianhui Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lianhui Wang updated SPARK-4621:

Description: in ExternalShuffle, we can use LRUCache to store recently 
finished shuffle index and that can reduce indexFile's io. At first, i 
implement it for ExternalShuffle. Latter i will add it to 
IndexShuffleBlockResolver if i can.  (was: in IndexShuffleBlockManager, we can 
use LRUCache to store recently finished shuffle index and that can reduce 
indexFile's io.)

> Shuffle index can be cached for SortShuffleManager in ExternalShuffle in 
> order to  reduce indexFile's io
> 
>
> Key: SPARK-4621
> URL: https://issues.apache.org/jira/browse/SPARK-4621
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Reporter: Lianhui Wang
>Assignee: Apache Spark
>Priority: Minor
>
> in ExternalShuffle, we can use LRUCache to store recently finished shuffle 
> index and that can reduce indexFile's io. At first, i implement it for 
> ExternalShuffle. Latter i will add it to IndexShuffleBlockResolver if i can.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-4621) Shuffle index can be cached for SortShuffleManager in ExternalShuffle in order to reduce indexFile's io

2015-12-12 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-4621:
---

Assignee: Apache Spark

> Shuffle index can be cached for SortShuffleManager in ExternalShuffle in 
> order to  reduce indexFile's io
> 
>
> Key: SPARK-4621
> URL: https://issues.apache.org/jira/browse/SPARK-4621
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Reporter: Lianhui Wang
>Assignee: Apache Spark
>Priority: Minor
>
> in IndexShuffleBlockManager, we can use LRUCache to store recently finished 
> shuffle index and that can reduce indexFile's io.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4621) Shuffle index can be cached for SortShuffleManager in ExternalShuffle in order to reduce indexFile's io

2015-12-12 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15054346#comment-15054346
 ] 

Apache Spark commented on SPARK-4621:
-

User 'lianhuiwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/10277

> Shuffle index can be cached for SortShuffleManager in ExternalShuffle in 
> order to  reduce indexFile's io
> 
>
> Key: SPARK-4621
> URL: https://issues.apache.org/jira/browse/SPARK-4621
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Reporter: Lianhui Wang
>Priority: Minor
>
> in IndexShuffleBlockManager, we can use LRUCache to store recently finished 
> shuffle index and that can reduce indexFile's io.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4621) Shuffle index can be cached for SortShuffleManager in ExternalShuffle in order to reduce indexFile's io

2015-12-12 Thread Lianhui Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lianhui Wang updated SPARK-4621:

Summary: Shuffle index can be cached for SortShuffleManager in 
ExternalShuffle in order to  reduce indexFile's io  (was: Shuffle index can 
cached for SortShuffleManager in ExternalShuffle in order to  reduce 
indexFile's io)

> Shuffle index can be cached for SortShuffleManager in ExternalShuffle in 
> order to  reduce indexFile's io
> 
>
> Key: SPARK-4621
> URL: https://issues.apache.org/jira/browse/SPARK-4621
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Reporter: Lianhui Wang
>Priority: Minor
>
> in IndexShuffleBlockManager, we can use LRUCache to store recently finished 
> shuffle index and that can reduce indexFile's io.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-12137) Spark Streaming State Recovery limitations

2015-12-12 Thread Ravindar (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15054339#comment-15054339
 ] 

Ravindar edited comment on SPARK-12137 at 12/12/15 2:55 PM:


Sean, thanks for the clarification on the current recovery functionality as 
applied to system failure only. One has to manually delete the checkpoint 
directory when the processing steps change as a part of upgrade; this has to be 
treated separately as application functionality.

For application state continuity in upgrade scenario, the application has to 
explicitly save last state for *updateStateByKey* in each iteration and then 
restore if last saved exists else create a default value. Or this state is 
already there in existing *checkpointing* that you can lookup and retrieve

I am looking a best practice in this scenario (any streaming examples?) with 
following questions

1. Do you serialize/deserialize to/from HDFS with key as file name and state as 
content
2. Do you serialize/deserialize to/from Cassandra with key, content


was (Author: rroopreddy):
Sean, thanks for the clarification on the current functionality. One has to 
manually delete the checkpoint directory when the processing steps change as a 
part of upgrade.

For state continuity in upgrade scenario, the application has to explicitly 
save last state for *updateStateByKey* in each iteration and then restore if 
last saved exists else create a default value. Or this state is already there 
in existing *checkpointing* that you can lookup and retrieve

I am looking a best practice in this scenario (any streaming examples?) with 
following questions

1. Do you serialize/deserialize to/from HDFS with key as file name and state as 
content
2. Do you serialize/deserialize to/from Cassandra with key, content

> Spark Streaming State Recovery limitations
> --
>
> Key: SPARK-12137
> URL: https://issues.apache.org/jira/browse/SPARK-12137
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.4.1
>Reporter: Ravindar
>Priority: Critical
>
> There was multiple threads in forums asking similar question without a clear 
> answer and hence entering it here.
> We have a streaming application that goes through multi-step processing. In 
> some of these steps stateful operations like *updateStateByKey* are used to 
> maintain an accumulated running state (and other state info) with incoming 
> RDD streams. As streaming application is incremental, it is imperative that 
> we recover/restore from previous known state in the following two scenarios
>   1. On spark driver/streaming application failure.
>  In this scenario the driver/streaming application shutdown and 
> restarted. The recommended approach is enable the *checkpoint(checkpointDir)* 
> and use *StreamingContext.getOrCreate* to restore the context from checkpoint 
> state.
>   2. Upgrade driver/streaming application with additional steps in the 
> processing
>  In this scenario, we introduced new steps with downstream processing for 
> new functionality without changes to existing steps.  Upgrading the streaming 
> application with the new fails on  *StreamingContext.getOrCreate* as there is 
> mismatch in checkpoint saved.
> Both of the above scenarios needs a unified approach where accumulated state 
> has to be saved and restored. The first approach of restoring from checkpoint 
> works for driver failure but not code upgrade. When the application code 
> changed, there is a recommendation to delete checkpoint data when new code is 
> deployed. If so, how do you reconstitute all of the stateful (e.g: 
> updateStateByKey) information from the last run. Every streaming application 
> has to save  up-to-date state for each session represented by key and then 
> initialize it from this when a new session starts for the same key. Does 
> every application have to create their own mechanism given this is very 
> similar to current state checkpointing to HDFS. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4621) Shuffle index can cached for SortShuffleManager in ExternalShuffle in order to reduce indexFile's io

2015-12-12 Thread Lianhui Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lianhui Wang updated SPARK-4621:

Summary: Shuffle index can cached for SortShuffleManager in ExternalShuffle 
in order to  reduce indexFile's io  (was: when sort- based shuffle, Cache 
recently finished shuffle index can reduce indexFile's io)

> Shuffle index can cached for SortShuffleManager in ExternalShuffle in order 
> to  reduce indexFile's io
> -
>
> Key: SPARK-4621
> URL: https://issues.apache.org/jira/browse/SPARK-4621
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Reporter: Lianhui Wang
>Priority: Minor
>
> in IndexShuffleBlockManager, we can use LRUCache to store recently finished 
> shuffle index and that can reduce indexFile's io.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12137) Spark Streaming State Recovery limitations

2015-12-12 Thread Ravindar (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15054339#comment-15054339
 ] 

Ravindar commented on SPARK-12137:
--

Sean, thanks for the clarification on the current functionality. One has to 
manually delete the checkpoint directory when the processing steps change as a 
part of upgrade.

For state continuity in upgrade scenario, the application has to explicitly 
save last state for *updateStateByKey* in each iteration and then restore if 
last saved exists else create a default value. Or this state is already there 
in existing *checkpointing* that you can lookup and retrieve

I am looking a best practice in this scenario (any streaming examples?) with 
following questions

1. Do you serialize/deserialize to/from HDFS with key as file name and state as 
content
2. Do you serialize/deserialize to/from Cassandra with key, content

> Spark Streaming State Recovery limitations
> --
>
> Key: SPARK-12137
> URL: https://issues.apache.org/jira/browse/SPARK-12137
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.4.1
>Reporter: Ravindar
>Priority: Critical
>
> There was multiple threads in forums asking similar question without a clear 
> answer and hence entering it here.
> We have a streaming application that goes through multi-step processing. In 
> some of these steps stateful operations like *updateStateByKey* are used to 
> maintain an accumulated running state (and other state info) with incoming 
> RDD streams. As streaming application is incremental, it is imperative that 
> we recover/restore from previous known state in the following two scenarios
>   1. On spark driver/streaming application failure.
>  In this scenario the driver/streaming application shutdown and 
> restarted. The recommended approach is enable the *checkpoint(checkpointDir)* 
> and use *StreamingContext.getOrCreate* to restore the context from checkpoint 
> state.
>   2. Upgrade driver/streaming application with additional steps in the 
> processing
>  In this scenario, we introduced new steps with downstream processing for 
> new functionality without changes to existing steps.  Upgrading the streaming 
> application with the new fails on  *StreamingContext.getOrCreate* as there is 
> mismatch in checkpoint saved.
> Both of the above scenarios needs a unified approach where accumulated state 
> has to be saved and restored. The first approach of restoring from checkpoint 
> works for driver failure but not code upgrade. When the application code 
> changed, there is a recommendation to delete checkpoint data when new code is 
> deployed. If so, how do you reconstitute all of the stateful (e.g: 
> updateStateByKey) information from the last run. Every streaming application 
> has to save  up-to-date state for each session represented by key and then 
> initialize it from this when a new session starts for the same key. Does 
> every application have to create their own mechanism given this is very 
> similar to current state checkpointing to HDFS. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12305) Add Receiver scheduling info onto Spark Streaming web UI

2015-12-12 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15054201#comment-15054201
 ] 

Sean Owen commented on SPARK-12305:
---

Same [~proflin] please don't open nearly blank JIRAs like this. Wait until 
you've written up a clear description and then open it. I'm going to close 
these otherwise.

https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark

> Add Receiver scheduling info onto Spark Streaming web UI
> 
>
> Key: SPARK-12305
> URL: https://issues.apache.org/jira/browse/SPARK-12305
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.5.2
>Reporter: Liwei Lin
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12306) Add an option to ignore BlockRDD partition data loss

2015-12-12 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15054199#comment-15054199
 ] 

Sean Owen commented on SPARK-12306:
---

[~proflin] there's no detail here. It doesn't sound like something that should 
be optionally ignored. I'd have to close this unless you can make a case for 
this, or at least describe it.

> Add an option to ignore BlockRDD partition data loss
> 
>
> Key: SPARK-12306
> URL: https://issues.apache.org/jira/browse/SPARK-12306
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.5.2
>Reporter: Liwei Lin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12302) Example for servlet filter used by spark.ui.filters

2015-12-12 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12302:


Assignee: Apache Spark

> Example for servlet filter used by spark.ui.filters
> ---
>
> Key: SPARK-12302
> URL: https://issues.apache.org/jira/browse/SPARK-12302
> Project: Spark
>  Issue Type: Improvement
>  Components: Examples
>Affects Versions: 1.5.2
>Reporter: Kai Sasaki
>Assignee: Apache Spark
>Priority: Trivial
>  Labels: examples, security
>
> Although {{spark.ui.filters}} configuration uses simple servlet filter, it is 
> often difficult to understand how to write filter code and how to integrate 
> actual spark applications. 
> It can be help to write examples for trying secure Spark cluster.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12304) Make Spark Streaming web UI display more friendly Receiver graphs

2015-12-12 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12304:


Assignee: Apache Spark

> Make Spark Streaming web UI display more friendly Receiver graphs
> -
>
> Key: SPARK-12304
> URL: https://issues.apache.org/jira/browse/SPARK-12304
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.5.2
>Reporter: Liwei Lin
>Assignee: Apache Spark
>Priority: Minor
> Attachments: after-5.png, before-5.png
>
>
> Currently, the Spark Streaming web UI uses the same maxY when displays 'Input 
> Rate Times& Histograms' and 'Per-Receiver Times& Histograms'. 
> This may lead to somewhat un-friendly graphs: once we have tens of Receivers 
> or more, every 'Per-Receiver Times' line almost hits the ground.
> This issue proposes to calculate a new maxY against the original one, which 
> is shared among all the `Per-Receiver Times& Histograms' graphs.
> Before:
> !before-5.png!
> After:
> !after-5.png!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12302) Example for servlet filter used by spark.ui.filters

2015-12-12 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12302:


Assignee: (was: Apache Spark)

> Example for servlet filter used by spark.ui.filters
> ---
>
> Key: SPARK-12302
> URL: https://issues.apache.org/jira/browse/SPARK-12302
> Project: Spark
>  Issue Type: Improvement
>  Components: Examples
>Affects Versions: 1.5.2
>Reporter: Kai Sasaki
>Priority: Trivial
>  Labels: examples, security
>
> Although {{spark.ui.filters}} configuration uses simple servlet filter, it is 
> often difficult to understand how to write filter code and how to integrate 
> actual spark applications. 
> It can be help to write examples for trying secure Spark cluster.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12304) Make Spark Streaming web UI display more friendly Receiver graphs

2015-12-12 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12304:


Assignee: (was: Apache Spark)

> Make Spark Streaming web UI display more friendly Receiver graphs
> -
>
> Key: SPARK-12304
> URL: https://issues.apache.org/jira/browse/SPARK-12304
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.5.2
>Reporter: Liwei Lin
>Priority: Minor
> Attachments: after-5.png, before-5.png
>
>
> Currently, the Spark Streaming web UI uses the same maxY when displays 'Input 
> Rate Times& Histograms' and 'Per-Receiver Times& Histograms'. 
> This may lead to somewhat un-friendly graphs: once we have tens of Receivers 
> or more, every 'Per-Receiver Times' line almost hits the ground.
> This issue proposes to calculate a new maxY against the original one, which 
> is shared among all the `Per-Receiver Times& Histograms' graphs.
> Before:
> !before-5.png!
> After:
> !after-5.png!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12305) Add Receiver scheduling info onto Spark Streaming web UI

2015-12-12 Thread Liwei Lin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liwei Lin updated SPARK-12305:
--
Shepherd: Shixiong Zhu

> Add Receiver scheduling info onto Spark Streaming web UI
> 
>
> Key: SPARK-12305
> URL: https://issues.apache.org/jira/browse/SPARK-12305
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.5.2
>Reporter: Liwei Lin
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-12306) Add an option to ignore BlockRDD partition data loss

2015-12-12 Thread Liwei Lin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15054184#comment-15054184
 ] 

Liwei Lin edited comment on SPARK-12306 at 12/12/15 9:51 AM:
-

This issue is reported by me, and I'm working on it. :-)


was (Author: proflin):
This is reported by me, and I'm working on it. :-)

> Add an option to ignore BlockRDD partition data loss
> 
>
> Key: SPARK-12306
> URL: https://issues.apache.org/jira/browse/SPARK-12306
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.5.2
>Reporter: Liwei Lin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12306) Add an option to ignore BlockRDD partition data loss

2015-12-12 Thread Liwei Lin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15054184#comment-15054184
 ] 

Liwei Lin commented on SPARK-12306:
---

This is reported by me, and I'm working on it. :-)

> Add an option to ignore BlockRDD partition data loss
> 
>
> Key: SPARK-12306
> URL: https://issues.apache.org/jira/browse/SPARK-12306
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.5.2
>Reporter: Liwei Lin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12305) Add Receiver scheduling info onto Spark Streaming web UI

2015-12-12 Thread Liwei Lin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15054185#comment-15054185
 ] 

Liwei Lin commented on SPARK-12305:
---

This issue is reported by me, and I'm working on it. :-)

> Add Receiver scheduling info onto Spark Streaming web UI
> 
>
> Key: SPARK-12305
> URL: https://issues.apache.org/jira/browse/SPARK-12305
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.5.2
>Reporter: Liwei Lin
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12306) Add an option to ignore BlockRDD partition data loss

2015-12-12 Thread Liwei Lin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liwei Lin updated SPARK-12306:
--
Shepherd: Shixiong Zhu

> Add an option to ignore BlockRDD partition data loss
> 
>
> Key: SPARK-12306
> URL: https://issues.apache.org/jira/browse/SPARK-12306
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.5.2
>Reporter: Liwei Lin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12306) Add an option to ignore BlockRDD partition data loss

2015-12-12 Thread Liwei Lin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liwei Lin updated SPARK-12306:
--
Component/s: Streaming

> Add an option to ignore BlockRDD partition data loss
> 
>
> Key: SPARK-12306
> URL: https://issues.apache.org/jira/browse/SPARK-12306
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.5.2
>Reporter: Liwei Lin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12306) Add an option to ignore BlockRDD partition data loss

2015-12-12 Thread Liwei Lin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liwei Lin updated SPARK-12306:
--
Affects Version/s: 1.5.2

> Add an option to ignore BlockRDD partition data loss
> 
>
> Key: SPARK-12306
> URL: https://issues.apache.org/jira/browse/SPARK-12306
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 1.5.2
>Reporter: Liwei Lin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12306) Add an option to ignore BlockRDD partition data loss

2015-12-12 Thread Liwei Lin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liwei Lin updated SPARK-12306:
--
Summary: Add an option to ignore BlockRDD partition data loss  (was: ToEdit)

> Add an option to ignore BlockRDD partition data loss
> 
>
> Key: SPARK-12306
> URL: https://issues.apache.org/jira/browse/SPARK-12306
> Project: Spark
>  Issue Type: Improvement
>Reporter: Liwei Lin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12305) Add Receiver scheduling info onto Spark Streaming web UI

2015-12-12 Thread Liwei Lin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liwei Lin updated SPARK-12305:
--
Summary: Add Receiver scheduling info onto Spark Streaming web UI  (was: 
Adds Receiver scheduling info onto Spark Streaming web UI)

> Add Receiver scheduling info onto Spark Streaming web UI
> 
>
> Key: SPARK-12305
> URL: https://issues.apache.org/jira/browse/SPARK-12305
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.5.2
>Reporter: Liwei Lin
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12305) Adds Receiver scheduling info onto Spark Streaming web UI

2015-12-12 Thread Liwei Lin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liwei Lin updated SPARK-12305:
--
Summary: Adds Receiver scheduling info onto Spark Streaming web UI  (was: 
Adds Receiver scheduling info onto )

> Adds Receiver scheduling info onto Spark Streaming web UI
> -
>
> Key: SPARK-12305
> URL: https://issues.apache.org/jira/browse/SPARK-12305
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.5.2
>Reporter: Liwei Lin
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12305) Adds Receiver scheduling info onto Spark Streaming web UI

2015-12-12 Thread Liwei Lin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liwei Lin updated SPARK-12305:
--
Component/s: Streaming

> Adds Receiver scheduling info onto Spark Streaming web UI
> -
>
> Key: SPARK-12305
> URL: https://issues.apache.org/jira/browse/SPARK-12305
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.5.2
>Reporter: Liwei Lin
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12305) Adds Receiver scheduling info onto Spark Streaming web UI

2015-12-12 Thread Liwei Lin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liwei Lin updated SPARK-12305:
--
Affects Version/s: 1.5.2

> Adds Receiver scheduling info onto Spark Streaming web UI
> -
>
> Key: SPARK-12305
> URL: https://issues.apache.org/jira/browse/SPARK-12305
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.5.2
>Reporter: Liwei Lin
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12305) Adds Receiver scheduling info onto

2015-12-12 Thread Liwei Lin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liwei Lin updated SPARK-12305:
--
Priority: Minor  (was: Critical)

> Adds Receiver scheduling info onto 
> ---
>
> Key: SPARK-12305
> URL: https://issues.apache.org/jira/browse/SPARK-12305
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.5.2
>Reporter: Liwei Lin
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-12305) Adds Receiver scheduling info onto

2015-12-12 Thread Liwei Lin (JIRA)

Liwei Lin created SPARK-12305:
-

 Summary: Adds Receiver scheduling info onto 
 Key: SPARK-12305
 URL: https://issues.apache.org/jira/browse/SPARK-12305
 Project: Spark
  Issue Type: Improvement
Reporter: Liwei Lin
Priority: Critical






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-12306) ToEdit

2015-12-12 Thread Liwei Lin (JIRA)

Liwei Lin created SPARK-12306:
-

 Summary: ToEdit
 Key: SPARK-12306
 URL: https://issues.apache.org/jira/browse/SPARK-12306
 Project: Spark
  Issue Type: Improvement
Reporter: Liwei Lin






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12304) Make Spark Streaming web UI display more friendly Receiver graphs

2015-12-12 Thread Liwei Lin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liwei Lin updated SPARK-12304:
--
Description: 
Currently, the Spark Streaming web UI uses the same maxY when displays 'Input 
Rate Times& Histograms' and 'Per-Receiver Times& Histograms'. 

This may lead to somewhat un-friendly graphs: once we have tens of Receivers or 
more, every 'Per-Receiver Times' line almost hits the ground.

This issue proposes to calculate a new maxY against the original one, which is 
shared among all the `Per-Receiver Times& Histograms' graphs.

Before:
!before-5.png!


After:
!after-5.png!


  was:
Currently, the Spark Streaming web UI uses the same maxY when displays 'Input 
Rate Times& Histograms' and 'Per-Receiver Times& Histograms'. 

This may lead to somewhat un-friendly graphs: once we have tens of Receivers or 
more, every 'Per-Receiver Times' line almost hits the ground.

This issue proposes to calculate a new maxY against the original one, which is 
shared among all the `Per-Receiver Times& Histograms' graphs.





> Make Spark Streaming web UI display more friendly Receiver graphs
> -
>
> Key: SPARK-12304
> URL: https://issues.apache.org/jira/browse/SPARK-12304
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.5.2
>Reporter: Liwei Lin
>Priority: Minor
> Attachments: after-5.png, before-5.png
>
>
> Currently, the Spark Streaming web UI uses the same maxY when displays 'Input 
> Rate Times& Histograms' and 'Per-Receiver Times& Histograms'. 
> This may lead to somewhat un-friendly graphs: once we have tens of Receivers 
> or more, every 'Per-Receiver Times' line almost hits the ground.
> This issue proposes to calculate a new maxY against the original one, which 
> is shared among all the `Per-Receiver Times& Histograms' graphs.
> Before:
> !before-5.png!
> After:
> !after-5.png!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12304) Make Spark Streaming web UI display more friendly Receiver graphs

2015-12-12 Thread Liwei Lin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liwei Lin updated SPARK-12304:
--
Description: 
Currently, the Spark Streaming web UI uses the same maxY when displays 'Input 
Rate Times& Histograms' and 'Per-Receiver Times& Histograms'. 

This may lead to somewhat un-friendly graphs: once we have tens of Receivers or 
more, every 'Per-Receiver Times' line almost hits the ground.

This issue proposes to calculate a new maxY against the original one, which is 
shared among all the `Per-Receiver Times& Histograms' graphs.




  was:
Currently, the Spark Streaming web UI uses the same maxY when displays 'Input 
Rate Times& Histograms' and 'Per-Receiver Times& Histograms'. 

This may lead to somewhat un-friendly graphs: once we have tens of Receivers or 
more, every 'Per-Receiver Times' line almost hits the ground.

This issue proposes to calculate a new maxY against the original one, which is 
shared among all the `Per-Receiver Times& Histograms' graphs.




> Make Spark Streaming web UI display more friendly Receiver graphs
> -
>
> Key: SPARK-12304
> URL: https://issues.apache.org/jira/browse/SPARK-12304
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.5.2
>Reporter: Liwei Lin
>Priority: Minor
> Attachments: after-5.png, before-5.png
>
>
> Currently, the Spark Streaming web UI uses the same maxY when displays 'Input 
> Rate Times& Histograms' and 'Per-Receiver Times& Histograms'. 
> This may lead to somewhat un-friendly graphs: once we have tens of Receivers 
> or more, every 'Per-Receiver Times' line almost hits the ground.
> This issue proposes to calculate a new maxY against the original one, which 
> is shared among all the `Per-Receiver Times& Histograms' graphs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12304) Make Spark Streaming web UI display more friendly Receiver graphs

2015-12-12 Thread Liwei Lin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liwei Lin updated SPARK-12304:
--
Attachment: after-5.png
before-5.png

> Make Spark Streaming web UI display more friendly Receiver graphs
> -
>
> Key: SPARK-12304
> URL: https://issues.apache.org/jira/browse/SPARK-12304
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.5.2
>Reporter: Liwei Lin
>Priority: Minor
> Attachments: after-5.png, before-5.png
>
>
> Currently, the Spark Streaming web UI uses the same maxY when displays 'Input 
> Rate Times& Histograms' and 'Per-Receiver Times& Histograms'. 
> This may lead to somewhat un-friendly graphs: once we have tens of Receivers 
> or more, every 'Per-Receiver Times' line almost hits the ground.
> This issue proposes to calculate a new maxY against the original one, which 
> is shared among all the `Per-Receiver Times& Histograms' graphs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12304) Make Spark Streaming web UI display more friendly Receiver graphs

2015-12-12 Thread Liwei Lin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liwei Lin updated SPARK-12304:
--
Summary: Make Spark Streaming web UI display more friendly Receiver graphs  
(was: Make Spark Streaming web UI display more friendly Receiver graph)

> Make Spark Streaming web UI display more friendly Receiver graphs
> -
>
> Key: SPARK-12304
> URL: https://issues.apache.org/jira/browse/SPARK-12304
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.5.2
>Reporter: Liwei Lin
>Priority: Minor
> Attachments: after-5.png, before-5.png
>
>
> Currently, the Spark Streaming web UI uses the same maxY when displays 'Input 
> Rate Times& Histograms' and 'Per-Receiver Times& Histograms'. 
> This may lead to somewhat un-friendly graphs: once we have tens of Receivers 
> or more, every 'Per-Receiver Times' line almost hits the ground.
> This issue proposes to calculate a new maxY against the original one, which 
> is shared among all the `Per-Receiver Times& Histograms' graphs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-12304) Make Spark Streaming web UI display more friendly Receiver graph

2015-12-12 Thread Liwei Lin (JIRA)

Liwei Lin created SPARK-12304:
-

 Summary: Make Spark Streaming web UI display more friendly 
Receiver graph
 Key: SPARK-12304
 URL: https://issues.apache.org/jira/browse/SPARK-12304
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.5.2
Reporter: Liwei Lin
Priority: Minor


Currently, the Spark Streaming web UI uses the same maxY when displays 'Input 
Rate Times& Histograms' and 'Per-Receiver Times& Histograms'. 

This may lead to somewhat un-friendly graphs: once we have tens of Receivers or 
more, every 'Per-Receiver Times' line almost hits the ground.

This issue proposes to calculate a new maxY against the original one, which is 
shared among all the `Per-Receiver Times& Histograms' graphs.





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-11193) Spark 1.5+ Kinesis Streaming - ClassCastException when starting KinesisReceiver

2015-12-12 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-11193.
---
   Resolution: Fixed
Fix Version/s: 1.6.1
   2.0.0

Issue resolved by pull request 10203
[https://github.com/apache/spark/pull/10203]

> Spark 1.5+ Kinesis Streaming - ClassCastException when starting 
> KinesisReceiver
> ---
>
> Key: SPARK-11193
> URL: https://issues.apache.org/jira/browse/SPARK-11193
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.5.0, 1.5.1
>Reporter: Phil Kallos
> Fix For: 2.0.0, 1.6.1
>
> Attachments: screen.png
>
>
> After upgrading from Spark 1.4.x -> 1.5.x, I am now unable to start a Kinesis 
> Spark Streaming application, and am being consistently greeted with this 
> exception:
> java.lang.ClassCastException: scala.collection.mutable.HashMap cannot be cast 
> to scala.collection.mutable.SynchronizedMap
>   at 
> org.apache.spark.streaming.kinesis.KinesisReceiver.onStart(KinesisReceiver.scala:175)
>   at 
> org.apache.spark.streaming.receiver.ReceiverSupervisor.startReceiver(ReceiverSupervisor.scala:148)
>   at 
> org.apache.spark.streaming.receiver.ReceiverSupervisor.start(ReceiverSupervisor.scala:130)
>   at 
> org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:542)
>   at 
> org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:532)
>   at 
> org.apache.spark.SparkContext$$anonfun$38.apply(SparkContext.scala:1982)
>   at 
> org.apache.spark.SparkContext$$anonfun$38.apply(SparkContext.scala:1982)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>   at org.apache.spark.scheduler.Task.run(Task.scala:88)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> Worth noting that I am able to reproduce this issue locally, and also on 
> Amazon EMR (using the latest emr-release 4.1.0 which packages Spark 1.5.0).
> Also, I am not able to run the included kinesis-asl example.
> Built locally using:
> git checkout v1.5.1
> mvn -Pyarn -Pkinesis-asl -Phadoop-2.6 -DskipTests clean package
> Example run command:
> bin/run-example streaming.KinesisWordCountASL phibit-test kinesis-connector 
> https://kinesis.us-east-1.amazonaws.com



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-2870) Thorough schema inference directly on RDDs of Python dictionaries

2015-12-12 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-2870.
--
Resolution: Not A Problem

> Thorough schema inference directly on RDDs of Python dictionaries
> -
>
> Key: SPARK-2870
> URL: https://issues.apache.org/jira/browse/SPARK-2870
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>Reporter: Nicholas Chammas
>
> h4. Background
> I love the {{SQLContext.jsonRDD()}} and {{SQLContext.jsonFile()}} methods. 
> They process JSON text directly and infer a schema that covers the entire 
> source data set. 
> This is very important with semi-structured data like JSON since individual 
> elements in the data set are free to have different structures. Matching 
> fields across elements may even have different value types.
> For example:
> {code}
> {"a": 5}
> {"a": "cow"}
> {code}
> To get a queryable schema that covers the whole data set, you need to infer a 
> schema by looking at the whole data set. The aforementioned 
> {{SQLContext.json...()}} methods do this very well. 
> h4. Feature Request
> What we need is for {{SQlContext.inferSchema()}} to do this, too. 
> Alternatively, we need a new {{SQLContext}} method that works on RDDs of 
> Python dictionaries and does something functionally equivalent to this:
> {code}
> SQLContext.jsonRDD(RDD[dict].map(lambda x: json.dumps(x)))
> {code}
> As of 1.0.2, 
> [{{inferSchema()}}|http://spark.apache.org/docs/latest/api/python/pyspark.sql.SQLContext-class.html#inferSchema]
>  just looks at the first element in the data set. This won't help much when 
> the structure of the elements in the target RDD is variable.
> h4. Example Use Case
> * You have some JSON text data that you want to analyze using Spark SQL. 
> * You would use one of the {{SQLContext.json...()}} methods, but you need to 
> do some filtering on the data first to remove bad elements--basically, some 
> minimal schema validation.
> * You deserialize the JSON objects to Python {{dict}} s and filter out the 
> bad ones. You now have an RDD of dictionaries.
> * From this RDD, you want a SchemaRDD that captures the schema for the whole 
> data set.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12303) Configuration parameter by which can choose if we want the REPL generated class directory name to be random or fixed name.

2015-12-12 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-12303:
--
  Priority: Minor  (was: Major)
Issue Type: New Feature  (was: Wish)

What would be the purpose of this?

> Configuration parameter by which  can choose if we want the REPL generated 
> class directory name to be random or fixed name.
> ---
>
> Key: SPARK-12303
> URL: https://issues.apache.org/jira/browse/SPARK-12303
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Shell
>Reporter: piyush
>Priority: Minor
>
>  .class generated by spark REPL are stored in a temp directory with random 
> name.
> Configuration parameter by which  can choose if we want the REPL generated 
> class directory name to be random or fixed name.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12218) Boolean logic in sql does not work "not (A and B)" is not the same as "(not A) or (not B)"

2015-12-12 Thread Xiao Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15054142#comment-15054142
 ] 

Xiao Li commented on SPARK-12218:
-

Found the fix, but that fix is not merged into 1.5.2. Tomorrow I will open a PR 
and let them merge the fix to 1.5.3. Thanks! 

> Boolean logic in sql does not work  "not (A and B)" is not the same as  "(not 
> A) or (not B)"
> 
>
> Key: SPARK-12218
> URL: https://issues.apache.org/jira/browse/SPARK-12218
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Irakli Machabeli
>Priority: Blocker
>
> Two identical queries produce different results
> In [2]: sqlContext.read.parquet('prp_enh1').where(" LoanID=62231 and not( 
> PaymentsReceived=0 and ExplicitRoll in ('PreviouslyPaidOff', 
> 'PreviouslyChargedOff'))").count()
> Out[2]: 18
> In [3]: sqlContext.read.parquet('prp_enh1').where(" LoanID=62231 and ( 
> not(PaymentsReceived=0) or not (ExplicitRoll in ('PreviouslyPaidOff', 
> 'PreviouslyChargedOff')))").count()
> Out[3]: 28



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12300) Fix schema inferance on local collections

2015-12-12 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12300:


Assignee: (was: Apache Spark)

> Fix schema inferance on local collections
> -
>
> Key: SPARK-12300
> URL: https://issues.apache.org/jira/browse/SPARK-12300
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Reporter: holdenk
>Priority: Minor
>
> Current schema inferance for local python collections halts as soon as there 
> are no NullTypes. This is different than when we specify a sampling ratio of 
> 1.0 on a distributed collection. This could result in incomplete schema 
> information.
> Repro:
> {code}
> input = [{"a": 1}, {"b": "coffee"}]
> df = sqlContext.createDataFrame(input)
> print df.schema
> {code}
> Discovered while looking at SPARK-2870



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12300) Fix schema inferance on local collections

2015-12-12 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12300:


Assignee: Apache Spark

> Fix schema inferance on local collections
> -
>
> Key: SPARK-12300
> URL: https://issues.apache.org/jira/browse/SPARK-12300
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Reporter: holdenk
>Assignee: Apache Spark
>Priority: Minor
>
> Current schema inferance for local python collections halts as soon as there 
> are no NullTypes. This is different than when we specify a sampling ratio of 
> 1.0 on a distributed collection. This could result in incomplete schema 
> information.
> Repro:
> {code}
> input = [{"a": 1}, {"b": "coffee"}]
> df = sqlContext.createDataFrame(input)
> print df.schema
> {code}
> Discovered while looking at SPARK-2870



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

56 matches

Mail list logo