date:20151231


[ 
https://issues.apache.org/jira/browse/SPARK-12537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15075831#comment-15075831
 ] 

Sean Owen commented on SPARK-12537:
---

(How about soliciting more opinions here to see where others' instincts lie 
about this issue?)

As it happens, on this point, libraries seem to agree though, including Spark's 
current behavior. You'd have to have a somewhat compelling reason to change 
Spark's behavior, and to not match other implementations. It's at best neutral 
though. There are upsides and downsides to enabling this behavior; either 
default has a potential problem and benefit. Hence, to me it's clear that 
preserving the de facto standard behavior makes sense.

I'm surprised you're suggesting otherwise, so I wonder if I am missing context? 
is there an additional compelling reason to change behavior that overrides the 
problem this could present?

> Add option to accept quoting of all character backslash quoting mechanism
> -
>
> Key: SPARK-12537
> URL: https://issues.apache.org/jira/browse/SPARK-12537
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Cazen Lee
>Assignee: Apache Spark
>
> We can provides the option to choose JSON parser can be enabled to accept 
> quoting of all character or not.
> For example, if JSON file that includes not listed by JSON backslash quoting 
> specification, it returns corrupt_record
> {code:title=JSON File|borderStyle=solid}
> {"name": "Cazen Lee", "price": "$10"}
> {"name": "John Doe", "price": "\$20"}
> {"name": "Tracy", "price": "$10"}
> {code}
> corrupt_record(returns null)
> {code}
> scala> df.show
> ++-+-+
> | _corrupt_record| name|price|
> ++-+-+
> |null|Cazen Lee|  $10|
> |{"name": "John Do...| null| null|
> |null|Tracy|  $10|
> ++-+-+
> {code}
> And after apply this patch, we can enable allowBackslashEscapingAnyCharacter 
> option like below
> {code}
> scala> val df = sqlContext.read.option("allowBackslashEscapingAnyCharacter", 
> "true").json("/user/Cazen/test/test2.txt")
> df: org.apache.spark.sql.DataFrame = [name: string, price: string]
> scala> df.show
> +-+-+
> | name|price|
> +-+-+
> |Cazen Lee|  $10|
> | John Doe|  $20|
> |Tracy|  $10|
> +-+-+
> {code}
> This issue similar to HIVE-11825, HIVE-12717.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-7995) Remove AkkaRpcEnv and remove Akka from the dependencies of Core


 [ 
https://issues.apache.org/jira/browse/SPARK-7995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-7995.

   Resolution: Fixed
Fix Version/s: 2.0.0

> Remove AkkaRpcEnv and remove Akka from the dependencies of Core
> ---
>
> Key: SPARK-7995
> URL: https://issues.apache.org/jira/browse/SPARK-7995
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-6280) Remove Akka systemName from Spark


 [ 
https://issues.apache.org/jira/browse/SPARK-6280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-6280.

   Resolution: Fixed
Fix Version/s: 2.0.0

> Remove Akka systemName from Spark
> -
>
> Key: SPARK-6280
> URL: https://issues.apache.org/jira/browse/SPARK-6280
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
> Fix For: 2.0.0
>
>
> `systemName` is a Akka concept. A RPC implementation does not need to support 
> it. 
> We can hard code the system name in Spark and hide it in the internal Akka 
> RPC implementation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4086) Fold-style aggregation for VertexRDD


 [ 
https://issues.apache.org/jira/browse/SPARK-4086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-4086.
--
Resolution: Won't Fix

> Fold-style aggregation for VertexRDD
> 
>
> Key: SPARK-4086
> URL: https://issues.apache.org/jira/browse/SPARK-4086
> Project: Spark
>  Issue Type: Improvement
>  Components: GraphX
>Reporter: Ankur Dave
>
> VertexRDD currently supports creations and joins only through a reduce-style 
> interface where the user specifies how to merge two conflicting values. We 
> should also support a fold-style interface that takes a default value 
> (possibly of a different type than the input collection) and a fold function 
> specifying how to accumulate values.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4961) Put HadoopRDD.getPartitions forward to reduce DAGScheduler.JobSubmitted processing time


 [ 
https://issues.apache.org/jira/browse/SPARK-4961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-4961.
--
Resolution: Won't Fix

> Put HadoopRDD.getPartitions forward to reduce DAGScheduler.JobSubmitted 
> processing time
> ---
>
> Key: SPARK-4961
> URL: https://issues.apache.org/jira/browse/SPARK-4961
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Reporter: YanTang Zhai
>Priority: Minor
>
> HadoopRDD.getPartitions is lazyied to process in DAGScheduler.JobSubmitted. 
> If inputdir is large, getPartitions may spend much time.
> For example, in our cluster, it needs from 0.029s to 766.699s. If one 
> JobSubmitted event is processing, others should wait. Thus, we
> want to put HadoopRDD.getPartitions forward to reduce 
> DAGScheduler.JobSubmitted processing time. Then other JobSubmitted event don't
> need to wait much time. HadoopRDD object could get its partitons when it is 
> instantiated.
> We could analyse and compare the execution time before and after optimization.
> TaskScheduler.start execution time: [time1__]
> DAGScheduler.JobSubmitted (excluding HadoopRDD.getPartitions or 
> TaskScheduler.start) execution time: [time2_]
> HadoopRDD.getPartitions execution time: [time3___]
> Stages execution time: [time4_]
> (1) The app has only one job
> (a)
> The execution time of the job before optimization is 
> [time1__][time2_][time3___][time4_].
> The execution time of the job after optimization 
> is[time1__][time3___][time2_][time4_].
> In summary, if the app has only one job, the total execution time is same 
> before and after optimization.
> (2) The app has 4 jobs
> (a) Before optimization,
> job1 execution time is [time2_][time3___][time4_],
> job2 execution time is [time2__][time3___][time4_],
> job3 execution time 
> is[time2][time3___][time4_],
> job4 execution time 
> is[time2_][time3___][time4_].
> After optimization, 
> job1 execution time is [time3___][time2_][time4_],
> job2 execution time is [time3___][time2__][time4_],
> job3 execution time 
> is[time3___][time2_][time4_],
> job4 execution time 
> is[time3___][time2__][time4_].
> In summary, if the app has multiple jobs, average execution time after 
> optimization is less than before.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12438) Add SQLUserDefinedType support for encoder


[ 
https://issues.apache.org/jira/browse/SPARK-12438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15075850#comment-15075850
 ] 

Apache Spark commented on SPARK-12438:
--

User 'thomastechs' has created a pull request for this issue:
https://github.com/apache/spark/pull/10529

> Add SQLUserDefinedType support for encoder
> --
>
> Key: SPARK-12438
> URL: https://issues.apache.org/jira/browse/SPARK-12438
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>
> We should add SQLUserDefinedType support for encoder.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-1061) allow Hadoop RDDs to be read w/ a partitioner


[ 
https://issues.apache.org/jira/browse/SPARK-1061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15075854#comment-15075854
 ] 

Sean Owen commented on SPARK-1061:
--

Is this still live?

> allow Hadoop RDDs to be read w/ a partitioner
> -
>
> Key: SPARK-1061
> URL: https://issues.apache.org/jira/browse/SPARK-1061
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: Imran Rashid
>Assignee: Imran Rashid
>
> Using partitioners to get narrow dependencies can save tons of time on a 
> shuffle.  However, after saving an RDD to hdfs, and then reloading it, all 
> partitioner information is lost.  This means that you can never get a narrow 
> dependency when loading data from hadoop.
> I think we could get around this by:
> 1) having a modified version of hadoop rdd that kept track of original part 
> file (or maybe just prevent splits altogether ...)
> 2) add a "assumePartition(partitioner:Partitioner, verify: Boolean)" function 
> to RDD.  It would create a new RDD, which had the exact same data but just 
> pretended that the RDD had the given partitioner applied to it.  And if 
> verify=true, it could add a mapPartitionsWithIndex to check that each record 
> was in the right partition.
> http://apache-spark-user-list.1001560.n3.nabble.com/setting-partitioners-with-hadoop-rdds-td976.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12039) HiveSparkSubmitSuite's SPARK-9757 Persist Parquet relation with decimal column is very flaky


 [ 
https://issues.apache.org/jira/browse/SPARK-12039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-12039:

Assignee: Yin Huai

> HiveSparkSubmitSuite's SPARK-9757 Persist Parquet relation with decimal 
> column is very flaky
> 
>
> Key: SPARK-12039
> URL: https://issues.apache.org/jira/browse/SPARK-12039
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Reporter: Yin Huai
>Assignee: Yin Huai
> Fix For: 2.0.0
>
>
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE=hadoop1.0,label=spark-test/4121/consoleFull
> It frequently fails to download 
> `commons-httpclient#commons-httpclient;3.0.1!commons-httpclient.jar` in 
> hadoop 1 tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-12039) HiveSparkSubmitSuite's SPARK-9757 Persist Parquet relation with decimal column is very flaky


 [ 
https://issues.apache.org/jira/browse/SPARK-12039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-12039.
-
   Resolution: Fixed
Fix Version/s: 2.0.0

> HiveSparkSubmitSuite's SPARK-9757 Persist Parquet relation with decimal 
> column is very flaky
> 
>
> Key: SPARK-12039
> URL: https://issues.apache.org/jira/browse/SPARK-12039
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Reporter: Yin Huai
> Fix For: 2.0.0
>
>
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE=hadoop1.0,label=spark-test/4121/consoleFull
> It frequently fails to download 
> `commons-httpclient#commons-httpclient;3.0.1!commons-httpclient.jar` in 
> hadoop 1 tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-5832) Add Affinity Propagation clustering algorithm


 [ 
https://issues.apache.org/jira/browse/SPARK-5832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-5832.
--
  Resolution: Won't Fix
Target Version/s:   (was: )

> Add Affinity Propagation clustering algorithm
> -
>
> Key: SPARK-5832
> URL: https://issues.apache.org/jira/browse/SPARK-5832
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
>  Labels: clustering
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-6105) enhance spark-ganglia to support redundant gmond addresses setting in ganglia unicast mode


 [ 
https://issues.apache.org/jira/browse/SPARK-6105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-6105.
--
Resolution: Won't Fix

> enhance spark-ganglia to support redundant gmond addresses setting in ganglia 
> unicast mode
> --
>
> Key: SPARK-6105
> URL: https://issues.apache.org/jira/browse/SPARK-6105
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.2.1
>Reporter: Fuqing Yang
>Priority: Minor
>
> 1 why do it?
> when using spark-ganglia with ganglia unicast mode, I found it has single 
> point of failure in gmond setting.
> we can not set *.sink.ganglia.host to localhost but only to a designated 
> gmond address because gmond do not forward the metrics from spark 
> GangliaSink. This problem can found in ganglia maillist:
> [Ganglia-general] Gmond forward messages
> http://sourceforge.net/p/ganglia/mailman/message/30268666/
> [Ganglia-general] gmond forwarding
> http://sourceforge.net/p/ganglia/mailman/message/28290307/
> So, in ganglia unicast mode, we may missing the metrics if the designated 
> gmond went wrong.
> 2 how to resolve this SPOF
> add new parameter *.sink.ganglia.servers in conf/metrics.properties.
> eg:
> *.sink.ganglia.servers=host1:port1, host2:port2
> if enabled spark ganglia sink, use GangliaSink to parse 
> *.sink.ganglia.servers to construct redundant GangliaReporters.
> for backend compatibility, preserve the old parameters *.sink.ganglia.host, 
> *.sink.ganglia.port, If users do not use *.sink.ganglia.servers, give a 
> warning, and the old parameters work.
> 3 problems introduced.
> none



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-7441) Implement microbatch functionality so that Spark Streaming can process a large backlog of existing files discovered in batch in smaller batches


 [ 
https://issues.apache.org/jira/browse/SPARK-7441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-7441.
--
Resolution: Won't Fix

> Implement microbatch functionality so that Spark Streaming can process a 
> large backlog of existing files discovered in batch in smaller batches
> ---
>
> Key: SPARK-7441
> URL: https://issues.apache.org/jira/browse/SPARK-7441
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Reporter: Emre Sevinç
>  Labels: performance
>
> Implement microbatch functionality so that Spark Streaming can process a huge 
> backlog of existing files discovered in batch in smaller batches.
> Spark Streaming can process already existing files in a directory, and 
> depending on the value of "{{spark.streaming.minRememberDuration}}" (60 
> seconds by default, see SPARK-3276 for more details), this might mean that a 
> Spark Streaming application can receive thousands, or hundreds of thousands 
> of files within the first batch interval. This, in turn, leads to something 
> like a 'flooding' effect for the streaming application, that tries to deal 
> with a huge number of existing files in a single batch interval.
>  We will propose a very simple change to 
> {{org.apache.spark.streaming.dstream.FileInputDStream}}, so that, based on a 
> configuration property such as "{{spark.streaming.microbatch.size}}", it will 
> either keep its default behavior when  {{spark.streaming.microbatch.size}} 
> will have the default value of {{0}} (meaning as many as has been discovered 
> as new files in the current batch interval), or will process new files in 
> groups of {{spark.streaming.microbatch.size}} (e.g. in groups of 100s).
> We have tested this patch in one of our customers, and it's been running 
> successfully for weeks (e.g. there were cases where our Spark Streaming 
> application was stopped, and in the meantime tens of thousands file were 
> created in a directory, and our Spark Streaming application had to process 
> those existing files after it was started).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-6157) Unrolling with MEMORY_AND_DISK should always release memory


 [ 
https://issues.apache.org/jira/browse/SPARK-6157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-6157.
--
Resolution: Won't Fix

> Unrolling with MEMORY_AND_DISK should always release memory
> ---
>
> Key: SPARK-6157
> URL: https://issues.apache.org/jira/browse/SPARK-6157
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager
>Affects Versions: 1.2.1
>Reporter: SuYan
>Assignee: SuYan
>
> === EDIT by andrewor14 ===
> The existing description was somewhat confusing, so here's a more succinct 
> version of it.
> If unrolling a block with MEMORY_AND_DISK was unsuccessful, we will drop the 
> block to disk
> directly. After doing so, however, we don't need the underlying array that 
> held the partial
> values anymore, so we should release the pending unroll memory for other 
> tasks on the same
> executor. Otherwise, other tasks may unnecessarily drop their blocks to disk 
> due to the lack
> of unroll space, resulting in worse performance.
> === Original comment ===
> Current code:
> Now we want to cache a Memory_and_disk level block
> 1. Try to put in memory and unroll unsuccessful. then reserved unroll memory 
> because we got a iterator from an unroll Array 
> 2. Then put into disk.
> 3. Get value from get(blockId), and iterator from that value, and then 
> nothing with an unroll Array. So here we should release the reserved unroll 
> memory instead will release  until the task is end.
> and also, have somebody already pull a request, for get Memory_and_disk level 
> block, while cache in memory from disk, we should, use file.length to check 
> if we can put in memory store instead just allocate a file.length buffer, may 
> lead to OOM.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-6332) compute calibration curve for binary classifiers


 [ 
https://issues.apache.org/jira/browse/SPARK-6332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-6332.
--
Resolution: Won't Fix

> compute calibration curve for binary classifiers
> 
>
> Key: SPARK-6332
> URL: https://issues.apache.org/jira/browse/SPARK-6332
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Robert Dodier
>Priority: Minor
>  Labels: classification
>
> For binary classifiers, calibration measures how classifier scores compare to 
> the proportion of positive examples. If the classifier is well-calibrated, 
> the classifier score is approximately equal to the proportion of positive 
> examples. This is important if the scores are used as probabilities for 
> making decisions via expected cost. Otherwise, the calibration curve may 
> still be interesting; the proportion of positive examples should at least be 
> a monotonic function of the score.
> I propose that a new method for calibration be added to the class 
> BinaryClassificationMetrics, since calibration seems to fit in with the ROC 
> curve and other classifier assessments. 
> For more about calibration, see: 
> http://en.wikipedia.org/wiki/Calibration_%28statistics%29#In_classification
> References:
> Mahdi Pakdaman Naeini, Gregory F. Cooper, Milos Hauskrecht. "Binary 
> Classifier Calibration: Non-parametric approach." 
> http://arxiv.org/abs/1401.3390
> Alexandru Niculescu-Mizil, Rich Caruana. "Predicting Good Probabilities With 
> Supervised Learning." Appearing in Proceedings of the 22nd International 
> Conference on Machine Learning, Bonn, Germany, 2005. 
> http://www.cs.cornell.edu/~alexn/papers/calibration.icml05.crc.rev3.pdf
> "Properties and benefits of calibrated classifiers." Ira Cohen, Moises 
> Goldszmidt. http://www.hpl.hp.com/techreports/2004/HPL-2004-22R1.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-2426) Quadratic Minimization for MLlib ALS


 [ 
https://issues.apache.org/jira/browse/SPARK-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-2426.
--
Resolution: Won't Fix

> Quadratic Minimization for MLlib ALS
> 
>
> Key: SPARK-2426
> URL: https://issues.apache.org/jira/browse/SPARK-2426
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 1.4.0
>Reporter: Debasish Das
>Assignee: Debasish Das
>   Original Estimate: 504h
>  Remaining Estimate: 504h
>
> Current ALS supports least squares and nonnegative least squares.
> I presented ADMM and IPM based Quadratic Minimization solvers to be used for 
> the following ALS problems:
> 1. ALS with bounds
> 2. ALS with L1 regularization
> 3. ALS with Equality constraint and bounds
> Initial runtime comparisons are presented at Spark Summit. 
> http://spark-summit.org/2014/talk/quadratic-programing-solver-for-non-negative-matrix-factorization-with-spark
> Based on Xiangrui's feedback I am currently comparing the ADMM based 
> Quadratic Minimization solvers with IPM based QpSolvers and the default 
> ALS/NNLS. I will keep updating the runtime comparison results.
> For integration the detailed plan is as follows:
> 1. Add QuadraticMinimizer and Proximal algorithms in mllib.optimization
> 2. Integrate QuadraticMinimizer in mllib ALS



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-6157) Unrolling with MEMORY_AND_DISK should always release memory


 [ 
https://issues.apache.org/jira/browse/SPARK-6157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen closed SPARK-6157.

Assignee: (was: SuYan)

> Unrolling with MEMORY_AND_DISK should always release memory
> ---
>
> Key: SPARK-6157
> URL: https://issues.apache.org/jira/browse/SPARK-6157
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager
>Affects Versions: 1.2.1
>Reporter: SuYan
>
> === EDIT by andrewor14 ===
> The existing description was somewhat confusing, so here's a more succinct 
> version of it.
> If unrolling a block with MEMORY_AND_DISK was unsuccessful, we will drop the 
> block to disk
> directly. After doing so, however, we don't need the underlying array that 
> held the partial
> values anymore, so we should release the pending unroll memory for other 
> tasks on the same
> executor. Otherwise, other tasks may unnecessarily drop their blocks to disk 
> due to the lack
> of unroll space, resulting in worse performance.
> === Original comment ===
> Current code:
> Now we want to cache a Memory_and_disk level block
> 1. Try to put in memory and unroll unsuccessful. then reserved unroll memory 
> because we got a iterator from an unroll Array 
> 2. Then put into disk.
> 3. Get value from get(blockId), and iterator from that value, and then 
> nothing with an unroll Array. So here we should release the reserved unroll 
> memory instead will release  until the task is end.
> and also, have somebody already pull a request, for get Memory_and_disk level 
> block, while cache in memory from disk, we should, use file.length to check 
> if we can put in memory store instead just allocate a file.length buffer, may 
> lead to OOM.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-5036) Better support sending partial messages in Pregel API


 [ 
https://issues.apache.org/jira/browse/SPARK-5036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-5036.
--
Resolution: Won't Fix

> Better support sending partial messages in Pregel API
> -
>
> Key: SPARK-5036
> URL: https://issues.apache.org/jira/browse/SPARK-5036
> Project: Spark
>  Issue Type: Improvement
>  Components: GraphX
>Reporter: shijinkui
> Attachments: s1.jpeg, s2.jpeg
>
>
> Better support sending partial messages in Pregel API
> 1. the reqirement
> In many iterative graph algorithms, only a part of the vertexes (we call them 
> ActiveVertexes) need to send messages to their neighbours in each iteration. 
> In many cases, ActiveVertexes are the vertexes that their attributes do not 
> change between the previous and current iteration. To implement this 
> requirement, we can use Pregel API + a flag (e.g., `bool isAttrChanged`) in 
> each vertex's attribute. 
> However, after `aggregateMessage` or `mapReduceTriplets` of each iteration, 
> we need to reset this flag to the init value in every vertex, which needs a 
> heavy `joinVertices`. 
> We find a more efficient way to meet this requirement and want to discuss it 
> here.
> Look at a simple example as follows:
> In i-th iteartion, the previous attribute of each vertex is `Attr` and the 
> newly computed attribute is `NewAttr`:
> |VID| Attr| NewAttr| Neighbours|
> |:|:-|:|:--|
> | 1 | 4| 5| 2, 3 |
> | 2 | 3| 2| 1, 4 |
> | 3 | 2| 2| 1, 4 |
> | 4|  3| 4| 1, 2, 3 |
> Our requirement is that: 
> 1.Set each vertex's `Attr` to be `NewAttr` in i-th iteration
> 2.For each vertex whose `Attr!=NewAttr`, send message to its neighbours 
> in the next iteration's `aggregateMessage`.
> We found it is hard to implement this requirment using current Pregel API 
> efficiently. The reason is that we not only need to perform `pregel()` to  
> compute the `NewAttr`  (2) but also need to perform `outJoin()` to satisfy 
> (1).
> A simple idea is to keep a `isAttrChanged:Boolean` (solution 1)  or 
> `flag:Int` (solution 2) in each vertex's attribute.
>  2. two solution  
> ---
> 2.1 solution 1: label and reset `isAttrChanged:Boolean` of Vertex Attr
> ![alt text](s1.jpeg "Title")
> 1. init message by `aggregateMessage`
>   it return a messageRDD
> 2. `innerJoin`
>   compute the messages on the received vertex, return a new VertexRDD 
> which have the computed value by customed logic function `vprog`, set 
> `isAttrChanged = true`
> 3. `outerJoinVertices`
>   update the changed vertex to the whole graph. now the graph is new.
> 4. `aggregateMessage`. it return a messageRDD
> 5. `joinVertices`  reset erery `isAttrChanged` of Vertex attr to false
>   ```
>   //  here reset the isAttrChanged to false
>   g = updateG.joinVertices(updateG.vertices) {
>   (vid, oriVertex, updateGVertex) => updateGVertex.reset()
>   }
>```
>here need to reset the vertex attribute object's variable as false
> if don't reset the `isAttrChanged`, it will send message next iteration 
> directly.
> **result:**  
> * Edge: 890041895 
> * Vertex: 181640208
> * Iterate: 150 times
> * Cost total: 8.4h
> * can't run until the 0 message 
> solution 2. color vertex
> ![alt text](s2.jpeg "Title")
> iterate process:
> 1. innerJoin 
>   `vprog` using as a partial function, looks like `vprog(curIter, _: 
> VertexId, _: VD, _: A)`
>   ` i = i + 1; val curIter = i`. 
>   in `vprog`, user can fetch `curIter` and assign to `falg`.
> 2. outerJoinVertices
>   `graph = graph.outerJoinVertices(changedVerts) { (vid, old, newOpt) => 
> newOpt.getOrElse(old)}.cache()`
> 3. aggregateMessages 
>   sendMsg is partial function, looks like `sendMsg(curIter, _: 
> EdgeContext[VD, ED, A]`
>   **in `sendMsg`, compare `curIter` with `flag`, determine whether 
> sending message**
>   result
> raw data   from
> * vertex: 181640208
> * edge: 890041895
> |  | iteration average cost | 150 iteration cost | 420 iteration cost | 
> |  | - |  |  |
> |  solution 1 | 188m | 7.8h | cannot finish  |
> |  solution 2 | 24 | 1.2h   | 3.1h | 
> | compare  | 7x  | 6.5x  | finished in 3.1 |
> 
> ##the end
> 
> i think the second solution(Pregel + a flag) is better.
> this can really support the iterative graph algorithms which only part of the 
> vertexes send messages to their neighbours in each iteration.
> we shall use it in product environment.
> pr: https://github.com/apache/spark/pull/3866
> EOF



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additiona

[jira] [Resolved] (SPARK-4675) Find similar products and similar users in MatrixFactorizationModel


 [ 
https://issues.apache.org/jira/browse/SPARK-4675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-4675.
--
Resolution: Won't Fix

> Find similar products and similar users in MatrixFactorizationModel
> ---
>
> Key: SPARK-4675
> URL: https://issues.apache.org/jira/browse/SPARK-4675
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Steven Bourke
>Priority: Trivial
>  Labels: mllib, recommender
>
> Using the latent feature space that is learnt in MatrixFactorizationModel, I 
> have added 2 new functions to find similar products and similar users. A user 
> of the API can for example pass a product ID, and get the closest products. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-1987) More memory-efficient graph construction


 [ 
https://issues.apache.org/jira/browse/SPARK-1987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-1987.
--
Resolution: Won't Fix

> More memory-efficient graph construction
> 
>
> Key: SPARK-1987
> URL: https://issues.apache.org/jira/browse/SPARK-1987
> Project: Spark
>  Issue Type: Improvement
>  Components: GraphX
>Reporter: Ankur Dave
>Assignee: Ankur Dave
>
> A graph's edges are usually the largest component of the graph. GraphX 
> currently stores edges in parallel primitive arrays, so each edge should only 
> take 20 bytes to store (srcId: Long, dstId: Long, attr: Int). However, the 
> current implementation in EdgePartitionBuilder uses an array of Edge objects 
> as an intermediate representation for sorting, so each edge additionally 
> takes about 40 bytes during graph construction (srcId (8) + dstId (8) + attr 
> (4) + uncompressed pointer (8) + object overhead (8) + padding (4)). This 
> unnecessarily increases GraphX's memory requirements by a factor of 3.
> To save memory, EdgePartitionBuilder should instead use a custom sort routine 
> that operates directly on the three parallel arrays.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4976) trust region Newton optimizer in mllib


 [ 
https://issues.apache.org/jira/browse/SPARK-4976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-4976.
--
Resolution: Won't Fix

> trust region Newton optimizer in mllib
> --
>
> Key: SPARK-4976
> URL: https://issues.apache.org/jira/browse/SPARK-4976
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Ching-Pei Lee
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Trust region Newton method (TRON) is a truncated Newton method for optimizing 
> twice-differentiable, unconstrained convex optimization problems. It has been 
> shown in recent research 
> (http://www.csie.ntu.edu.tw/~cjlin/papers/logistic.pdf) to be faster than 
> L-BFGS when solving logistic regression problems.
> One of the use can be training large-scale logistic regression/least square 
> regression with many features and instances.
> We'll use breeze implementation of TRON.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4526) Gradient should be added batch computing interface


 [ 
https://issues.apache.org/jira/browse/SPARK-4526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-4526.
--
Resolution: Won't Fix

> Gradient should be added batch computing interface
> --
>
> Key: SPARK-4526
> URL: https://issues.apache.org/jira/browse/SPARK-4526
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Guoqiang Li
>Assignee: Guoqiang Li
>
> If Gradient support batch computing, we can use some efficient numerical 
> libraries(eg, BLAS).
> In some cases, it can improve the performance of more than ten times as much.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4902) gap-sampling performance optimization


 [ 
https://issues.apache.org/jira/browse/SPARK-4902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-4902.
--
Resolution: Won't Fix

> gap-sampling performance optimization
> -
>
> Key: SPARK-4902
> URL: https://issues.apache.org/jira/browse/SPARK-4902
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.2.0
>Reporter: Guoqiang Li
>
> {{CacheManager.getOrCompute}} returns an instance of InterruptibleIterator 
> that contains an array or a iterator(when the memory is not enough). 
> The GapSamplingIterator implementation is as follows
> {code}
> private val iterDrop: Int => Unit = {
> val arrayClass = Array.empty[T].iterator.getClass
> val arrayBufferClass = ArrayBuffer.empty[T].iterator.getClass
> data.getClass match {
>   case `arrayClass` => ((n: Int) => { data = data.drop(n) })
>   case `arrayBufferClass` => ((n: Int) => { data = data.drop(n) })
>   case _ => ((n: Int) => {
>   var j = 0
>   while (j < n && data.hasNext) {
> data.next()
> j += 1
>   }
> })
> }
>   }
> {code}
> The code does not deal with InterruptibleIterator.
> This leads to the following code can't use the {{Iterator.drop}} method
> {code}
> rdd.cache()
> rdd.sample(false,0.1)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-12590) Inconsistent behavior of randomSplit in YARN mode


 [ 
https://issues.apache.org/jira/browse/SPARK-12590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-12590.
---
Resolution: Not A Problem

Yes, I think you've hit it on the head: the issue is that you're recomputing a 
non-deterministic RDD, and so you aren't seeing consistent results. However, 
the non-determinism isn't actually the randomSplit call (it's seeded even), but 
the repartition with a shuffle. Generally, this behavior is as expected, and 
indeed, you have to cache rdd2 in order to not recompute it.

> Inconsistent behavior of randomSplit in YARN mode
> -
>
> Key: SPARK-12590
> URL: https://issues.apache.org/jira/browse/SPARK-12590
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib, Spark Core
>Affects Versions: 1.5.2
> Environment: YARN mode
>Reporter: Gaurav Kumar
>
> I noticed an inconsistent behavior when using rdd.randomSplit when the source 
> rdd is repartitioned, but only in YARN mode. It works fine in local mode 
> though.
> *Code:*
> val rdd = sc.parallelize(1 to 100)
> val rdd2 = rdd.repartition(64)
> rdd.partitions.size
> rdd2.partitions.size
> val Array(train, test) = rdd2.randomSplit(Array(70, 30), 1)
> train.takeOrdered(10)
> test.takeOrdered(10)
> *Master: local*
> Both the take statements produce consistent results and have no overlap in 
> numbers being outputted.
> *Master: YARN*
> However, when these are run on YARN mode, these produce random results every 
> time and also the train and test have overlap in the numbers being outputted.
> If I use rdd.randomSplit, then it works fine even on YARN.
> So, it concludes that the repartition is being evaluated every time the 
> splitting occurs.
> Interestingly, if I cache the rdd2 before splitting it, then we can expect 
> consistent behavior since repartition is not evaluated again and again.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12590) Inconsistent behavior of randomSplit in YARN mode

2015-12-31 Thread Gaurav Kumar (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15075902#comment-15075902
 ] 

Gaurav Kumar commented on SPARK-12590:
--

Thanks [~srowen] for the explanation.
I think most users, unaware of such behavior, tend to do either of these 2 
kinds of things:
1. Cache the source RDD and then do a {{randomSplit}} and use the train and 
test going forward. This won't be an issue since the source RDD is cached.
2. Do a {{randomSplit}} and then cache train and test separately. This will 
create an issue with the splitting.
I think, there should be a warning of some sort in the randomSplit's 
documentation bewaring the users of such behavior. It took me quite a while to 
debug the overlap between train and test sets.

> Inconsistent behavior of randomSplit in YARN mode
> -
>
> Key: SPARK-12590
> URL: https://issues.apache.org/jira/browse/SPARK-12590
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib, Spark Core
>Affects Versions: 1.5.2
> Environment: YARN mode
>Reporter: Gaurav Kumar
>
> I noticed an inconsistent behavior when using rdd.randomSplit when the source 
> rdd is repartitioned, but only in YARN mode. It works fine in local mode 
> though.
> *Code:*
> val rdd = sc.parallelize(1 to 100)
> val rdd2 = rdd.repartition(64)
> rdd.partitions.size
> rdd2.partitions.size
> val Array(train, test) = rdd2.randomSplit(Array(70, 30), 1)
> train.takeOrdered(10)
> test.takeOrdered(10)
> *Master: local*
> Both the take statements produce consistent results and have no overlap in 
> numbers being outputted.
> *Master: YARN*
> However, when these are run on YARN mode, these produce random results every 
> time and also the train and test have overlap in the numbers being outputted.
> If I use rdd.randomSplit, then it works fine even on YARN.
> So, it concludes that the repartition is being evaluated every time the 
> splitting occurs.
> Interestingly, if I cache the rdd2 before splitting it, then we can expect 
> consistent behavior since repartition is not evaluated again and again.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-12591) NullPointerException using checkpointed mapWithState with KryoSerializer

Jan Uyttenhove created SPARK-12591:
--

 Summary: NullPointerException using checkpointed mapWithState with 
KryoSerializer
 Key: SPARK-12591
 URL: https://issues.apache.org/jira/browse/SPARK-12591
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.6.0
 Environment: MacOSX
Java(TM) SE Runtime Environment (build 1.8.0_20-ea-b17)
Reporter: Jan Uyttenhove


Issue occured after upgrading to the RC4 of Spark (streaming) 1.6.0 to (re)test 
the new mapWithState API, after previously reporting issue SPARK-11932 
(https://issues.apache.org/jira/browse/SPARK-11932). 

For initial report, see 
http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-streaming-1-6-0-RC4-NullPointerException-using-mapWithState-tt15830.html

Narrowed it down to an issue unrelated to Kafka directstream, but, after 
observing very unpredictable behavior as a result of changes to the Kafka 
messages format, it seems to be related to KryoSerialization in specific.

For test case, see my modified version of the StatefulNetworkWordCount example: 
https://gist.github.com/juyttenh/9b4a4103699a7d5f698f 

To reproduce, use RC4 of Spark-1.6.0 and 
- nc -lk 
- execute the supplied test case: 
bin/spark-submit --class 
org.apache.spark.examples.streaming.StatefulNetworkWordCount --master local[2] 
file:///some-assembly-jar localhost 

Error scenario:
- put some text in the nc console with the job running, and observe correct 
functioning of the word count
- kill the spark job
- add some more text in the nc console (with the job not running)
- restart the spark job and observe the NPE
(you might need to repeat this a couple of times to trigger the exception)

Here's the stacktrace;
15/12/31 11:43:47 ERROR Executor: Exception in task 0.0 in stage 4.0 (TID 5)
java.lang.NullPointerException
at 
org.apache.spark.streaming.util.OpenHashMapBasedStateMap.get(StateMap.scala:103)
at 
org.apache.spark.streaming.util.OpenHashMapBasedStateMap.get(StateMap.scala:111)
at 
org.apache.spark.streaming.util.OpenHashMapBasedStateMap.get(StateMap.scala:111)
at 
org.apache.spark.streaming.rdd.MapWithStateRDDRecord$$anonfun$updateRecordWithData$1.apply(MapWithStateRDD.scala:56)
at 
org.apache.spark.streaming.rdd.MapWithStateRDDRecord$$anonfun$updateRecordWithData$1.apply(MapWithStateRDD.scala:55)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at 
org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
at 
org.apache.spark.streaming.rdd.MapWithStateRDDRecord$.updateRecordWithData(MapWithStateRDD.scala:55)
at 
org.apache.spark.streaming.rdd.MapWithStateRDD.compute(MapWithStateRDD.scala:154)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:268)
at 
org.apache.spark.streaming.rdd.MapWithStateRDD.compute(MapWithStateRDD.scala:148)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:268)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
15/12/31 11:43:47 INFO TaskSetManager: Starting task 1.0 in stage 4.0 (TID 6, 
localhost, partition 1,NODE_LOCAL, 2239 bytes)
15/12/31 11:43:47 INFO Executor: Running task 1.0 in stage 4.0 (TID 6)
15/12/31 11:43:47 WARN TaskSetManager: Lost task 0.0 in stage 4.0 (TID 5, 
localhost): java.lang.NullPointerException
at 
org.apache.spark.streaming.util.OpenHashMapBasedStateMap.get(StateMap.scala:103)
at 
org.apache.spark.streaming.util.OpenHashMapBasedStateMap.get(StateMap.scala:111)
at 
org.apache.spark.streaming.util.OpenHashMapBasedStateMap.get(StateMap.scala:111)
at 
org.apache.spark.streaming.rdd.MapWithStateRDDRecord$$anonfun$updateRecordWithData$1.apply(MapWithStateRDD.scala:56)
at

[jira] [Commented] (SPARK-8555) Online Variational Inference for the Hierarchical Dirichlet Process

2015-12-31 Thread Tu Dinh Nguyen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15075909#comment-15075909
 ] 

Tu Dinh Nguyen commented on SPARK-8555:
---

Hi,

I'm Tu from Deakin University. Our team is currently working intensively to 
develop this model on Spark.
We also would like to integrate into the MLLib. Could anyone please tell me 
what the workflow looks like?

Your help is much appreciated!


> Online Variational Inference for the Hierarchical Dirichlet Process
> ---
>
> Key: SPARK-8555
> URL: https://issues.apache.org/jira/browse/SPARK-8555
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: yuhao yang
>Priority: Minor
>
> The task is created for exploration on the online HDP algorithm described in
> http://jmlr.csail.mit.edu/proceedings/papers/v15/wang11a/wang11a.pdf.
> Major advantage for the algorithm: one pass on corpus, streaming friendly, 
> automatic K (topic number).
> Currently the scope is to support online HDP for topic modeling, i.e. 
> probably an optimizer for LDA.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-12592) TestHive.reset hides Spark testing logs

2015-12-31 Thread Cheng Lian (JIRA)

Cheng Lian created SPARK-12592:
--

 Summary: TestHive.reset hides Spark testing logs
 Key: SPARK-12592
 URL: https://issues.apache.org/jira/browse/SPARK-12592
 Project: Spark
  Issue Type: Test
  Components: Tests
Reporter: Cheng Lian


There's a hack done in {{TestHive.reset()}}, which intended to mute noisy Hive 
loggers. However, Spark testing loggers are also muted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12592) TestHive.reset hides Spark testing logs


[ 
https://issues.apache.org/jira/browse/SPARK-12592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15075910#comment-15075910
 ] 

Apache Spark commented on SPARK-12592:
--

User 'liancheng' has created a pull request for this issue:
https://github.com/apache/spark/pull/10540

> TestHive.reset hides Spark testing logs
> ---
>
> Key: SPARK-12592
> URL: https://issues.apache.org/jira/browse/SPARK-12592
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Reporter: Cheng Lian
>
> There's a hack done in {{TestHive.reset()}}, which intended to mute noisy 
> Hive loggers. However, Spark testing loggers are also muted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12591) NullPointerException using checkpointed mapWithState with KryoSerializer


 [ 
https://issues.apache.org/jira/browse/SPARK-12591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jan Uyttenhove updated SPARK-12591:
---
Description: 
Issue occured after upgrading to the RC4 of Spark (streaming) 1.6.0 to (re)test 
the new mapWithState API, after previously reporting issue SPARK-11932 
(https://issues.apache.org/jira/browse/SPARK-11932). 

For initial report, see 
http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-streaming-1-6-0-RC4-NullPointerException-using-mapWithState-tt15830.html

Narrowed it down to an issue unrelated to Kafka directstream, but, after 
observing very unpredictable behavior as a result of changes to the Kafka 
messages format, it seems to be related to KryoSerialization in specific.

For test case, see my modified version of the StatefulNetworkWordCount example: 
https://gist.github.com/juyttenh/9b4a4103699a7d5f698f 

To reproduce, use RC4 of Spark-1.6.0 and 
- nc -lk 
- execute the supplied test case: 
bin/spark-submit --class 
org.apache.spark.examples.streaming.StatefulNetworkWordCount --master local[2] 
file:///some-assembly-jar localhost 

Error scenario:
- put some text in the nc console with the job running, and observe correct 
functioning of the word count
- kill the spark job
- add some more text in the nc console (with the job not running)
- restart the spark job and observe the NPE
(you might need to repeat this a couple of times to trigger the exception)

Here's the stacktrace: 
{code}
15/12/31 11:43:47 ERROR Executor: Exception in task 0.0 in stage 4.0 (TID 5)
java.lang.NullPointerException
at 
org.apache.spark.streaming.util.OpenHashMapBasedStateMap.get(StateMap.scala:103)
at 
org.apache.spark.streaming.util.OpenHashMapBasedStateMap.get(StateMap.scala:111)
at 
org.apache.spark.streaming.util.OpenHashMapBasedStateMap.get(StateMap.scala:111)
at 
org.apache.spark.streaming.rdd.MapWithStateRDDRecord$$anonfun$updateRecordWithData$1.apply(MapWithStateRDD.scala:56)
at 
org.apache.spark.streaming.rdd.MapWithStateRDDRecord$$anonfun$updateRecordWithData$1.apply(MapWithStateRDD.scala:55)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at 
org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
at 
org.apache.spark.streaming.rdd.MapWithStateRDDRecord$.updateRecordWithData(MapWithStateRDD.scala:55)
at 
org.apache.spark.streaming.rdd.MapWithStateRDD.compute(MapWithStateRDD.scala:154)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:268)
at 
org.apache.spark.streaming.rdd.MapWithStateRDD.compute(MapWithStateRDD.scala:148)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:268)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
15/12/31 11:43:47 INFO TaskSetManager: Starting task 1.0 in stage 4.0 (TID 6, 
localhost, partition 1,NODE_LOCAL, 2239 bytes)
15/12/31 11:43:47 INFO Executor: Running task 1.0 in stage 4.0 (TID 6)
15/12/31 11:43:47 WARN TaskSetManager: Lost task 0.0 in stage 4.0 (TID 5, 
localhost): java.lang.NullPointerException
at 
org.apache.spark.streaming.util.OpenHashMapBasedStateMap.get(StateMap.scala:103)
at 
org.apache.spark.streaming.util.OpenHashMapBasedStateMap.get(StateMap.scala:111)
at 
org.apache.spark.streaming.util.OpenHashMapBasedStateMap.get(StateMap.scala:111)
at 
org.apache.spark.streaming.rdd.MapWithStateRDDRecord$$anonfun$updateRecordWithData$1.apply(MapWithStateRDD.scala:56)
at 
org.apache.spark.streaming.rdd.MapWithStateRDDRecord$$anonfun$updateRecordWithData$1.apply(MapWithStateRDD.scala:55)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at 
org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)

[jira] [Updated] (SPARK-12591) NullPointerException using checkpointed mapWithState with KryoSerializer


 [ 
https://issues.apache.org/jira/browse/SPARK-12591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jan Uyttenhove updated SPARK-12591:
---
Description: 
Issue occured after upgrading to the RC4 of Spark (streaming) 1.6.0 to (re)test 
the new mapWithState API, after previously reporting issue SPARK-11932 
(https://issues.apache.org/jira/browse/SPARK-11932). 

For initial report, see 
http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-streaming-1-6-0-RC4-NullPointerException-using-mapWithState-tt15830.html

Narrowed it down to an issue unrelated to Kafka directstream, but, after 
observing very unpredictable behavior as a result of changes to the Kafka 
messages format, it seems to be related to KryoSerialization in specific.

For test case, see my modified version of the StatefulNetworkWordCount example: 
https://gist.github.com/juyttenh/9b4a4103699a7d5f698f 

To reproduce, use RC4 of Spark-1.6.0 and 
- {code}nc -lk {code}
- execute the supplied test case: 
{code}bin/spark-submit --class 
org.apache.spark.examples.streaming.StatefulNetworkWordCount --master local[2] 
file:///some-assembly-jar localhost {code}

Error scenario:
- put some text in the nc console with the job running, and observe correct 
functioning of the word count
- kill the spark job
- add some more text in the nc console (with the job not running)
- restart the spark job and observe the NPE
(you might need to repeat this a couple of times to trigger the exception)

Here's the stacktrace: 
{code}
15/12/31 11:43:47 ERROR Executor: Exception in task 0.0 in stage 4.0 (TID 5)
java.lang.NullPointerException
at 
org.apache.spark.streaming.util.OpenHashMapBasedStateMap.get(StateMap.scala:103)
at 
org.apache.spark.streaming.util.OpenHashMapBasedStateMap.get(StateMap.scala:111)
at 
org.apache.spark.streaming.util.OpenHashMapBasedStateMap.get(StateMap.scala:111)
at 
org.apache.spark.streaming.rdd.MapWithStateRDDRecord$$anonfun$updateRecordWithData$1.apply(MapWithStateRDD.scala:56)
at 
org.apache.spark.streaming.rdd.MapWithStateRDDRecord$$anonfun$updateRecordWithData$1.apply(MapWithStateRDD.scala:55)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at 
org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
at 
org.apache.spark.streaming.rdd.MapWithStateRDDRecord$.updateRecordWithData(MapWithStateRDD.scala:55)
at 
org.apache.spark.streaming.rdd.MapWithStateRDD.compute(MapWithStateRDD.scala:154)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:268)
at 
org.apache.spark.streaming.rdd.MapWithStateRDD.compute(MapWithStateRDD.scala:148)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:268)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
15/12/31 11:43:47 INFO TaskSetManager: Starting task 1.0 in stage 4.0 (TID 6, 
localhost, partition 1,NODE_LOCAL, 2239 bytes)
15/12/31 11:43:47 INFO Executor: Running task 1.0 in stage 4.0 (TID 6)
15/12/31 11:43:47 WARN TaskSetManager: Lost task 0.0 in stage 4.0 (TID 5, 
localhost): java.lang.NullPointerException
at 
org.apache.spark.streaming.util.OpenHashMapBasedStateMap.get(StateMap.scala:103)
at 
org.apache.spark.streaming.util.OpenHashMapBasedStateMap.get(StateMap.scala:111)
at 
org.apache.spark.streaming.util.OpenHashMapBasedStateMap.get(StateMap.scala:111)
at 
org.apache.spark.streaming.rdd.MapWithStateRDDRecord$$anonfun$updateRecordWithData$1.apply(MapWithStateRDD.scala:56)
at 
org.apache.spark.streaming.rdd.MapWithStateRDDRecord$$anonfun$updateRecordWithData$1.apply(MapWithStateRDD.scala:55)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at 
org.apache.spark.InterruptibleIterator.foreach(InterruptibleIte

[jira] [Updated] (SPARK-12591) NullPointerException using checkpointed mapWithState with KryoSerializer


 [ 
https://issues.apache.org/jira/browse/SPARK-12591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jan Uyttenhove updated SPARK-12591:
---
Description: 
Issue occured after upgrading to the RC4 of Spark (streaming) 1.6.0 to (re)test 
the new mapWithState API, after previously reporting issue SPARK-11932 
(https://issues.apache.org/jira/browse/SPARK-11932). 

For initial report, see 
http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-streaming-1-6-0-RC4-NullPointerException-using-mapWithState-tt15830.html

Narrowed it down to an issue unrelated to Kafka directstream, but, after 
observing very unpredictable behavior as a result of changes to the Kafka 
messages format, it seems to be related to KryoSerialization in specific.

For test case, see my modified version of the StatefulNetworkWordCount example: 
https://gist.github.com/juyttenh/9b4a4103699a7d5f698f 

To reproduce, use RC4 of Spark-1.6.0 and 
- start nc:
{code}nc -lk {code}
- execute the supplied test case: 
{code}bin/spark-submit --class 
org.apache.spark.examples.streaming.StatefulNetworkWordCount --master local[2] 
file:///some-assembly-jar localhost {code}

Error scenario:
- put some text in the nc console with the job running, and observe correct 
functioning of the word count
- kill the spark job
- add some more text in the nc console (with the job not running)
- restart the spark job and observe the NPE
(you might need to repeat this a couple of times to trigger the exception)

Here's the stacktrace: 
{code}
15/12/31 11:43:47 ERROR Executor: Exception in task 0.0 in stage 4.0 (TID 5)
java.lang.NullPointerException
at 
org.apache.spark.streaming.util.OpenHashMapBasedStateMap.get(StateMap.scala:103)
at 
org.apache.spark.streaming.util.OpenHashMapBasedStateMap.get(StateMap.scala:111)
at 
org.apache.spark.streaming.util.OpenHashMapBasedStateMap.get(StateMap.scala:111)
at 
org.apache.spark.streaming.rdd.MapWithStateRDDRecord$$anonfun$updateRecordWithData$1.apply(MapWithStateRDD.scala:56)
at 
org.apache.spark.streaming.rdd.MapWithStateRDDRecord$$anonfun$updateRecordWithData$1.apply(MapWithStateRDD.scala:55)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at 
org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
at 
org.apache.spark.streaming.rdd.MapWithStateRDDRecord$.updateRecordWithData(MapWithStateRDD.scala:55)
at 
org.apache.spark.streaming.rdd.MapWithStateRDD.compute(MapWithStateRDD.scala:154)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:268)
at 
org.apache.spark.streaming.rdd.MapWithStateRDD.compute(MapWithStateRDD.scala:148)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:268)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
15/12/31 11:43:47 INFO TaskSetManager: Starting task 1.0 in stage 4.0 (TID 6, 
localhost, partition 1,NODE_LOCAL, 2239 bytes)
15/12/31 11:43:47 INFO Executor: Running task 1.0 in stage 4.0 (TID 6)
15/12/31 11:43:47 WARN TaskSetManager: Lost task 0.0 in stage 4.0 (TID 5, 
localhost): java.lang.NullPointerException
at 
org.apache.spark.streaming.util.OpenHashMapBasedStateMap.get(StateMap.scala:103)
at 
org.apache.spark.streaming.util.OpenHashMapBasedStateMap.get(StateMap.scala:111)
at 
org.apache.spark.streaming.util.OpenHashMapBasedStateMap.get(StateMap.scala:111)
at 
org.apache.spark.streaming.rdd.MapWithStateRDDRecord$$anonfun$updateRecordWithData$1.apply(MapWithStateRDD.scala:56)
at 
org.apache.spark.streaming.rdd.MapWithStateRDDRecord$$anonfun$updateRecordWithData$1.apply(MapWithStateRDD.scala:55)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at 
org.apache.spark.InterruptibleIterator.foreach(Interr

[jira] [Assigned] (SPARK-12592) TestHive.reset hides Spark testing logs


 [ 
https://issues.apache.org/jira/browse/SPARK-12592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12592:


Assignee: Apache Spark

> TestHive.reset hides Spark testing logs
> ---
>
> Key: SPARK-12592
> URL: https://issues.apache.org/jira/browse/SPARK-12592
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Reporter: Cheng Lian
>Assignee: Apache Spark
>
> There's a hack done in {{TestHive.reset()}}, which intended to mute noisy 
> Hive loggers. However, Spark testing loggers are also muted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12592) TestHive.reset hides Spark testing logs


 [ 
https://issues.apache.org/jira/browse/SPARK-12592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12592:


Assignee: (was: Apache Spark)

> TestHive.reset hides Spark testing logs
> ---
>
> Key: SPARK-12592
> URL: https://issues.apache.org/jira/browse/SPARK-12592
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Reporter: Cheng Lian
>
> There's a hack done in {{TestHive.reset()}}, which intended to mute noisy 
> Hive loggers. However, Spark testing loggers are also muted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-12593) Convert resolved logical plans back to SQL query strings

2015-12-31 Thread Cheng Lian (JIRA)

Cheng Lian created SPARK-12593:
--

 Summary: Convert resolved logical plans back to SQL query strings
 Key: SPARK-12593
 URL: https://issues.apache.org/jira/browse/SPARK-12593
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 2.0.0
Reporter: Cheng Lian
Assignee: Cheng Lian






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12592) TestHive.reset hides Spark testing logs


[ 
https://issues.apache.org/jira/browse/SPARK-12592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15075930#comment-15075930
 ] 

Apache Spark commented on SPARK-12592:
--

User 'liancheng' has created a pull request for this issue:
https://github.com/apache/spark/pull/10541

> TestHive.reset hides Spark testing logs
> ---
>
> Key: SPARK-12592
> URL: https://issues.apache.org/jira/browse/SPARK-12592
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Reporter: Cheng Lian
>
> There's a hack done in {{TestHive.reset()}}, which intended to mute noisy 
> Hive loggers. However, Spark testing loggers are also muted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-12594) Join Conversion: Outer to Inner/Left/Right, Right to Inner and Left to Inner

2015-12-31 Thread Xiao Li (JIRA)

Xiao Li created SPARK-12594:
---

 Summary: Join Conversion: Outer to Inner/Left/Right, Right to 
Inner and Left to Inner
 Key: SPARK-12594
 URL: https://issues.apache.org/jira/browse/SPARK-12594
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.6.0
Reporter: Xiao Li
Priority: Critical


Conversion of outer joins, if the local predicates can restrict the result sets 
so that all null-supplying rows are eliminated. 

- full outer -> inner if both sides have such local predicates
- left outer -> inner if the right side has such local predicates
- right outer -> inner if the left side has such local predicates
- full outer -> left outer if only the left side has such local predicates
- full outer -> right outer if only the right side has such local predicates

If applicable, this can greatly improve the performance. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12594) Join Conversion: Outer to Inner/Left/Right, Right to Inner and Left to Inner


[ 
https://issues.apache.org/jira/browse/SPARK-12594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15076200#comment-15076200
 ] 

Apache Spark commented on SPARK-12594:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/10542

> Join Conversion: Outer to Inner/Left/Right, Right to Inner and Left to Inner
> 
>
> Key: SPARK-12594
> URL: https://issues.apache.org/jira/browse/SPARK-12594
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Xiao Li
>Priority: Critical
>
> Conversion of outer joins, if the local predicates can restrict the result 
> sets so that all null-supplying rows are eliminated. 
> - full outer -> inner if both sides have such local predicates
> - left outer -> inner if the right side has such local predicates
> - right outer -> inner if the left side has such local predicates
> - full outer -> left outer if only the left side has such local predicates
> - full outer -> right outer if only the right side has such local predicates
> If applicable, this can greatly improve the performance. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12594) Join Conversion: Outer to Inner/Left/Right, Right to Inner and Left to Inner


 [ 
https://issues.apache.org/jira/browse/SPARK-12594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12594:


Assignee: (was: Apache Spark)

> Join Conversion: Outer to Inner/Left/Right, Right to Inner and Left to Inner
> 
>
> Key: SPARK-12594
> URL: https://issues.apache.org/jira/browse/SPARK-12594
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Xiao Li
>Priority: Critical
>
> Conversion of outer joins, if the local predicates can restrict the result 
> sets so that all null-supplying rows are eliminated. 
> - full outer -> inner if both sides have such local predicates
> - left outer -> inner if the right side has such local predicates
> - right outer -> inner if the left side has such local predicates
> - full outer -> left outer if only the left side has such local predicates
> - full outer -> right outer if only the right side has such local predicates
> If applicable, this can greatly improve the performance. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12594) Join Conversion: Outer to Inner/Left/Right, Right to Inner and Left to Inner


 [ 
https://issues.apache.org/jira/browse/SPARK-12594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12594:


Assignee: Apache Spark

> Join Conversion: Outer to Inner/Left/Right, Right to Inner and Left to Inner
> 
>
> Key: SPARK-12594
> URL: https://issues.apache.org/jira/browse/SPARK-12594
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Xiao Li
>Assignee: Apache Spark
>Priority: Critical
>
> Conversion of outer joins, if the local predicates can restrict the result 
> sets so that all null-supplying rows are eliminated. 
> - full outer -> inner if both sides have such local predicates
> - left outer -> inner if the right side has such local predicates
> - right outer -> inner if the left side has such local predicates
> - full outer -> left outer if only the left side has such local predicates
> - full outer -> right outer if only the right side has such local predicates
> If applicable, this can greatly improve the performance. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12196) Store/retrieve blocks in different speed storage devices by hierarchy way

2015-12-31 Thread yucai (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-12196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

yucai updated SPARK-12196:
--
Description:
*Motivation*
Nowadays, customers have both SSDs(SATA SSD/PCIe SSD) and HDDs.
SSDs have great performance, but capacity is small.
HDDs have good capacity, but much slower than SSDs(x2-x3 slower than SATA SSD,
x20 slower than PCIe SSD).
How can we get both good?

*Proposal*
One solution is to build hierarchy store: use SSDs as cache and HDDs as backup
storage.
When Spark core allocates blocks (either for shuffle or RDD cache), it gets
blocks from SSDs first, and when SSD’s useable space is less than some
threshold, getting blocks from HDDs.

In our implementation, we actually go further. We support a way to build any
level hierarchy store access various storage medias (MEM, NVM, SSD, HDD etc.).

*Performance*
1. At the best case, our solution performs the same as all SSDs.
2. At the worst case, like all data are spilled to HDDs, no performance
regression.
3. Compared with all HDDs, hierarchy store improves more than *_x1.86_* (it
could be higher, CPU reaches bottleneck in our test environment).
4. Compared with Tachyon, our hierarchy store still *_x1.3_* faster. Because we
support both RDD cache and shuffle and no extra inter process communication.

*Test Environment*
1. 4 IVB box(40 cores, 192GB memory, 10GB Nic, 11HDDs/11SATA SSDs/PCIE SSD)
2. Real customer case NWeight(graph analysis), which is to compute associations
between two vertices that are n-hop away(e.g., friend-to-friend or
video-to-video relationship for recommendation).
3. Data Size: 22GB, Vertices: 41 milion, Edges: 1.4 billion.

*Usage*
1. Set the priority and threshold for each layer in
spark.storage.hierarchyStore.
{code}
spark.storage.hierarchyStore='nvm 40GB,ssd 20GB'
{code}
It builds a 3 layers hierarchy store: the 1st is "nvm", the 2nd is "sdd", all
the rest form the last layer.

2. Configure each layer's location, user just needs put the keyword like "nvm",
"ssd", which are specified in step 1, into local dirs, like spark.local.dir or
yarn.nodemanager.local-dirs.
{code}
spark.local.dir=/mnt/nvm1,/mnt/ssd1,/mnt/ssd2,/mnt/ssd3,/mnt/disk1,/mnt/disk2,/mnt/disk3,/mnt/disk4,/mnt/others
{code}

After then, restart your Spark application, it will allocate blocks from nvm
first.
When nvm's usable space is less than 40GB, it starts to allocate from ssd.
When ssd's usable space is less than 20GB, it starts to allocate from the last
layer.

was:
*Motivation*
Nowadays, customers have both SSDs(SATA SSD/PCIe SSD) and HDDs.
SSDs have great performance, but capacity is small.
HDDs have good capacity, but much slower than SSDs(x2-x3 slower than SATA SSD,
x20 slower than PCIe SSD).
How can we get both good?

*Proposal*
Our idea is to build hierarchy store: use SSDs as cache and HDDs as backup
storage.
When Spark core allocates blocks (either for shuffle or RDD cache), it gets
blocks from SSDs first, and when SSD’s useable space is less than some
threshold, getting blocks from HDDs.

In our implementation, we actually go further. We support a way to build any
level hierarchy store access various storage medias (MEM, NVM, SSD, HDD etc.).

> Store/retrieve blocks in different speed storage devices by hierar

[jira] [Commented] (SPARK-10359) Enumerate Spark's dependencies in a file and diff against it for new pull requests