Re: Minimum cost flow problem solving in Spark

2017-09-13 Thread Michael Malak
You might be interested in "Maximum Flow implementation on Spark GraphX" done by a Colorado School of Mines grad student a couple of years ago. http://datascienceassn.org/2016-01-27-maximum-flow-implementation-spark-graphx From: Swapnil Shinde To: u...@spark.ap

Re: Minimum cost flow problem solving in Spark

2017-09-13 Thread Michael Malak
You might be interested in "Maximum Flow implementation on Spark GraphX" done by a Colorado School of Mines grad student a couple of years ago. http://datascienceassn.org/2016-01-27-maximum-flow-implementation-spark-graphx From: Swapnil Shinde To: user@spark.ap

Re: Shortest path with directed and weighted graphs

2016-10-24 Thread Michael Malak
Chapter 6 of my book implements Dijkstra's Algorithm. The source code is available to download for free.  https://www.manning.com/books/spark-graphx-in-action From: Brian Wilson To: user@spark.apache.org Sent: Monday, October 24, 2016 7:11 AM Subject: Shortest path with directed and

Re: GraphX drawing algorithm

2016-09-11 Thread Michael Malak
In chapter 10 of Spark GraphX In Action, we describe how to use Zeppelin with d3.js to render graphs using d3's force-directed rendering algorithm. The source code can be downloaded for free from  https://www.manning.com/books/spark-graphx-in-action From: agc studio To: user@spark.apache.

Re: Where is DataFrame.scala in 2.0?

2016-06-03 Thread Michael Malak
It's been reduced to a single line of code. http://technicaltidbit.blogspot.com/2016/03/dataframedataset-swap-places-in-spark-20.html From: Gerhard Fiedler To: "dev@spark.apache.org" Sent: Friday, June 3, 2016 9:01 AM Subject: Where is DataFrame.scala in 2.0? When I look at the

Re: GraphX Java API

2016-05-30 Thread Michael Malak
Yes, it is possible to use GraphX from Java but it requires 10x the amount of code and involves using obscure typing and pre-defined lambda prototype facilities. I give an example of it in my book, the source code for which can be downloaded for free from  https://www.manning.com/books/spark-gra

Re: Adhoc queries on Spark 2.0 with Structured Streaming

2016-05-06 Thread Michael Malak
At first glance, it looks like the only streaming data sources available out of the box from the github master branch are  https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala  and  https://github.com/apache/spark/blob/

Re: Spark 2.0 forthcoming features

2016-04-20 Thread Michael Malak
http://go.databricks.com/apache-spark-2.0-presented-by-databricks-co-founder-reynold-xin From: Sourav Mazumder To: user Sent: Wednesday, April 20, 2016 11:07 AM Subject: Spark 2.0 forthcoming features Hi All, Is there somewhere we can get idea of the upcoming features in Spark 2

Re: Apache Flink

2016-04-17 Thread Michael Malak
As with all history, "what if"s are not scientifically testable hypotheses, but my speculation is the energy (VCs, startups, big Internet companies, universities) within Silicon Valley contrasted to Germany. From: Mich Talebzadeh To: Michael Malak ; "user @spark"

Re: Apache Flink

2016-04-17 Thread Michael Malak
There have been commercial CEP solutions for decades, including from my employer. From: Mich Talebzadeh To: Mark Hamstra Cc: Corey Nolet ; "user @spark" Sent: Sunday, April 17, 2016 3:48 PM Subject: Re: Apache Flink The problem is that the strength and wider acceptance of a typic

Re: Apache Flink

2016-04-17 Thread Michael Malak
In terms of publication date, a paper on Nephele was published in 2009, prior to the 2010 USENIX paper on Spark. Nephele is the execution engine of Stratosphere, which became Flink. From: Mark Hamstra To: Mich Talebzadeh Cc: Corey Nolet ; "user @spark" Sent: Sunday, April 17, 2016 3:

Re: [discuss] using deep learning to improve Spark

2016-04-01 Thread Michael Malak
I see you've been burning the midnight oil. From: Reynold Xin To: "dev@spark.apache.org" Sent: Friday, April 1, 2016 1:15 AM Subject: [discuss] using deep learning to improve Spark Hi all, Hope you all enjoyed the Tesla 3 unveiling earlier tonight. I'd like to bring your attention

Re: Spark with Druid

2016-03-23 Thread Michael Malak
Will Spark 2.0 Structured Streaming obviate some of the Druid/Spark use cases? From: Raymond Honderdors To: "yuzhih...@gmail.com" Cc: "user@spark.apache.org" Sent: Wednesday, March 23, 2016 8:43 AM Subject: Re: Spark with Druid I saw these but i fail to understand how to direct th

Re: [discuss] DataFrame vs Dataset in Spark 2.0

2016-02-25 Thread Michael Malak
Would it make sense (in terms of feasibility, code organization, and politically) to have a JavaDataFrame, as a way to isolate the 1000+ extra lines to a Java compatibility layer/class? From: Reynold Xin To: "dev@spark.apache.org" Sent: Thursday, February 25, 2016 4:23 PM Subject: [d

[jira] [Commented] (SPARK-3789) [GRAPHX] Python bindings for GraphX

2015-11-04 Thread Michael Malak (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-3789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14990391#comment-14990391 ] Michael Malak commented on SPARK-3789: -- My publisher tells me the MEAP for S

[jira] [Updated] (SPARK-11278) PageRank fails with unified memory manager

2015-10-23 Thread Michael Malak (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-11278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Malak updated SPARK-11278: -- Component/s: GraphX > PageRank fails with unified memory mana

[jira] [Commented] (SPARK-2365) Add IndexedRDD, an efficient updatable key-value store

2015-10-09 Thread Michael Malak (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-2365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14950679#comment-14950679 ] Michael Malak commented on SPARK-2365: -- It's off-topic of IndexedRDD, bu

[jira] [Commented] (SPARK-10939) Misaligned data with RDD.zip after repartition

2015-10-08 Thread Michael Malak (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-10939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14948596#comment-14948596 ] Michael Malak commented on SPARK-10939: --- Here Matei explains the explicit de

[jira] [Updated] (SPARK-10972) UDFs in SQL joins

2015-10-07 Thread Michael Malak (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-10972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Malak updated SPARK-10972: -- Description: Currently expressions used to .join() in DataFrames are limited to column names

[jira] [Created] (SPARK-10972) UDFs in SQL joins

2015-10-07 Thread Michael Malak (JIRA)
Michael Malak created SPARK-10972: - Summary: UDFs in SQL joins Key: SPARK-10972 URL: https://issues.apache.org/jira/browse/SPARK-10972 Project: Spark Issue Type: New Feature

[jira] [Commented] (SPARK-10722) Uncaught exception: RDDBlockId not found in driver-heartbeater

2015-09-27 Thread Michael Malak (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-10722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14909883#comment-14909883 ] Michael Malak commented on SPARK-10722: --- I have seen this in a small Hello W

[jira] [Commented] (SPARK-10489) GraphX dataframe wrapper

2015-09-10 Thread Michael Malak (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-10489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14739681#comment-14739681 ] Michael Malak commented on SPARK-10489: --- Feynman Liang: Link https://github

Re: Build k-NN graph for large dataset

2015-08-26 Thread Michael Malak
Yes. And a paper that describes using grids (actually varying grids) is  http://research.microsoft.com/en-us/um/people/jingdw/pubs%5CCVPR12-GraphConstruction.pdf  In the Spark GraphX In Action book that Robin East and I are writing, we implement a drastically simplified version of this in chapter

RE: What does "Spark is not just MapReduce" mean? Isn't every Spark job a form of MapReduce?

2015-06-28 Thread Michael Malak
I would also add, from a data locality theoretic standpoint, mapPartitions() provides for node-local computation that plain old map-reduce does not. From my Android phone on T-Mobile. The first nationwide 4G network. Original message From: Ashic Mahtab Date: 06/28/2015 10:5

Re: Why Spark is much faster than Hadoop MapReduce even on disk

2015-04-27 Thread Michael Malak
http://www.datascienceassn.org/content/making-sense-making-sense-performance-data-analytics-frameworks   From: "bit1...@163.com" To: user Sent: Monday, April 27, 2015 8:33 PM Subject: Why Spark is much faster than Hadoop MapReduce even on disk #yiv1713360705 body {line-height:1.5;}

Re: How to restrict foreach on a streaming RDD only once upon receiver completion

2015-04-06 Thread Michael Malak
You could have your receiver send a "magic value" when it is done. I discuss this Spark Streaming pattern in my presentation "Spark Gotchas and Anti-Patterns". In the PDF version, it's slides 34-36.http://www.datascienceassn.org/content/2014-11-05-spark-gotchas-and-anti-patterns-julia-language

[jira] [Created] (SPARK-6710) Wrong initial bias in GraphX SVDPlusPlus

2015-04-04 Thread Michael Malak (JIRA)
Michael Malak created SPARK-6710: Summary: Wrong initial bias in GraphX SVDPlusPlus Key: SPARK-6710 URL: https://issues.apache.org/jira/browse/SPARK-6710 Project: Spark Issue Type: Bug

Wrong initial bias in GraphX SVDPlusPlus?

2015-04-03 Thread Michael Malak
I believe that in the initialization portion of GraphX SVDPlusPluS, the initialization of biases is incorrect. Specifically, in line https://github.com/apache/spark/blob/master/graphx/src/main/scala/org/apache/spark/graphx/lib/SVDPlusPlus.scala#L96 instead of (vd._1, vd._2, msg.get._2 / msg.ge

Spark GraphX In Action on documentation page?

2015-03-24 Thread Michael Malak
Can my new book, Spark GraphX In Action, which is currently in MEAP http://manning.com/malak/, be added to https://spark.apache.org/documentation.html and, if appropriate, to https://spark.apache.org/graphx/ ? Michael Malak

[jira] [Commented] (SPARK-6388) Spark 1.3 + Hadoop 2.6 Can't work on Java 8_40

2015-03-17 Thread Michael Malak (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-6388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14365758#comment-14365758 ] Michael Malak commented on SPARK-6388: -- Isn't it Hadoop 2.7 that is su

textFile() ordering and header rows

2015-02-22 Thread Michael Malak
Since RDDs are generally unordered, aren't things like textFile().first() not guaranteed to return the first row (such as looking for a header row)? If so, doesn't that make the example in http://spark.apache.org/docs/1.2.1/quick-start.html#basics misleading? ---

[jira] [Commented] (SPARK-4279) Implementing TinkerPop on top of GraphX

2015-02-06 Thread Michael Malak (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-4279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14309459#comment-14309459 ] Michael Malak commented on SPARK-4279: -- Is there another place where I might be

Word2Vec IndexedRDD

2015-02-01 Thread Michael Malak
1. Is IndexedRDD planned for 1.3? https://issues.apache.org/jira/browse/SPARK-2365 2. Once IndexedRDD is in, is it planned to convert Word2VecModel to it from its current Map[String,Array[Float]]? https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/feature/Wo

Re: spark challenge: zip with next???

2015-01-30 Thread Michael Malak
But isn't foldLeft() overkill for the originally stated use case of max diff of adjacent pairs? Isn't foldLeft() for recursive non-commutative non-associative accumulation as opposed to an embarrassingly parallel operation such as this one? This use case reminds me of FIR filtering in DSP. It se

Re: renaming SchemaRDD -> DataFrame

2015-01-27 Thread Michael Malak
ose not immersed in data science or AI and thus may have narrower appeal. - Original Message ----- From: Evan R. Sparks To: Matei Zaharia Cc: Koert Kuipers ; Michael Malak ; Patrick Wendell ; Reynold Xin ; "dev@spark.apache.org" Sent: Tuesday, January 27, 2015 9:55 AM Subject: Re: renaming

Re: renaming SchemaRDD -> DataFrame

2015-01-26 Thread Michael Malak
And in the off chance that anyone hasn't seen it yet, the Jan. 13 Bay Area Spark Meetup YouTube contained a wealth of background information on this idea (mostly from Patrick and Reynold :-). https://www.youtube.com/watch?v=YWppYPWznSQ From: Patrick Wendell To:

Re: GraphX ShortestPaths backwards?

2015-01-20 Thread Michael Malak
I created https://issues.apache.org/jira/browse/SPARK-5343 for this. - Original Message - From: Michael Malak To: "dev@spark.apache.org" Cc: Sent: Monday, January 19, 2015 5:09 PM Subject: GraphX ShortestPaths backwards? GraphX ShortestPaths seems to be following edges

[jira] [Created] (SPARK-5343) ShortestPaths traverses backwards

2015-01-20 Thread Michael Malak (JIRA)
Michael Malak created SPARK-5343: Summary: ShortestPaths traverses backwards Key: SPARK-5343 URL: https://issues.apache.org/jira/browse/SPARK-5343 Project: Spark Issue Type: Bug

GraphX ShortestPaths backwards?

2015-01-19 Thread Michael Malak
GraphX ShortestPaths seems to be following edges backwards instead of forwards: import org.apache.spark.graphx._ val g = Graph(sc.makeRDD(Array((1L,""), (2L,""), (3L,""))), sc.makeRDD(Array(Edge(1L,2L,""), Edge(2L,3L,"" lib.ShortestPaths.run(g,Array(3)).vertices.collect res1: Array[(org.apac

Re: GraphX vertex partition/location strategy

2015-01-19 Thread Michael Malak
But wouldn't the gain be greater under something similar to EdgePartition1D (but perhaps better load-balanced based on number of edges for each vertex) and an algorithm that primarily follows edges in the forward direction? From: Ankur Dave To: Michael Malak Cc: "dev@spark.

GraphX vertex partition/location strategy

2015-01-19 Thread Michael Malak
Does GraphX make an effort to co-locate vertices onto the same workers as the majority (or even some) of its edges? - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache

GraphX doc: triangleCount() requirement overstatement?

2015-01-18 Thread Michael Malak
According to: https://spark.apache.org/docs/1.2.0/graphx-programming-guide.html#triangle-counting "Note that TriangleCount requires the edges to be in canonical orientation (srcId < dstId)" But isn't this overstating the requirement? Isn't the requirement really that IF there are duplicate ed

Re: RDD Moving Average

2015-01-06 Thread Michael Malak
Asim Jalis writes: > > ​Thanks. Another question. ​I have event data with timestamps. I want to > create a sliding window > using timestamps. Some windows will have a lot of events in them others > won’t. Is there a way > to get an RDD made of this kind of a variable length window? You should c

Re: GraphX rmatGraph hangs

2015-01-04 Thread Michael Malak
Thank you. I created https://issues.apache.org/jira/browse/SPARK-5064 - Original Message - From: xhudik To: dev@spark.apache.org Cc: Sent: Saturday, January 3, 2015 2:04 PM Subject: Re: GraphX rmatGraph hangs Hi Michael, yes, I can confirm the behavior. It get stuck (loop?) and eat a

[jira] [Created] (SPARK-5064) GraphX rmatGraph hangs

2015-01-03 Thread Michael Malak (JIRA)
Michael Malak created SPARK-5064: Summary: GraphX rmatGraph hangs Key: SPARK-5064 URL: https://issues.apache.org/jira/browse/SPARK-5064 Project: Spark Issue Type: Bug Components

GraphX rmatGraph hangs

2015-01-03 Thread Michael Malak
The following single line just hangs, when executed in either Spark Shell or standalone: org.apache.spark.graphx.util.GraphGenerators.rmatGraph(sc, 4, 8) It just outputs "0 edges" and then locks up. The only other information I've found via Google is: http://mail-archives.apache.org/mod_mbox/sp

Re: Rdd of Rdds

2014-10-22 Thread Michael Malak
On Wednesday, October 22, 2014 9:06 AM, Sean Owen wrote: > No, there's no such thing as an RDD of RDDs in Spark. > Here though, why not just operate on an RDD of Lists? or a List of RDDs? > Usually one of these two is the right approach whenever you feel > inclined to operate on an RDD of RDDs.

Re: UpdateStateByKey - How to improve performance?

2014-08-06 Thread Michael Malak
Depending on the density of your keys, the alternative signature def updateStateByKey[S](updateFunc: (Iterator[(K, Seq[V], Option[S])]) ? Iterator[(K, S)], partitioner: Partitioner, rememberPartitioner: Boolean)(implicit arg0: ClassTag[S]): DStream[(K, S)] at least iterates by key rather than

Re: relationship of RDD[Array[String]] to Array[Array[String]]

2014-07-21 Thread Michael Malak
It's really more of a Scala question than a Spark question, but the standard OO (not Scala-specific) way is to create your own custom supertype (e.g. MyCollectionTrait), inherited/implemented by two concrete classes (e.g. MyRDD and MyArray), each of which manually forwards method calls to the co

15 new MLlib algorithms

2014-07-09 Thread Michael Malak
At Spark Summit, Patrick Wendell indicated the number of MLlib algorithms would "roughly double" in 1.1 from the current approx. 15. http://spark-summit.org/wp-content/uploads/2014/07/Future-of-Spark-Patrick-Wendell.pdf What are the planned additional algorithms? In Jira, I only see two when fil

Re: parallel Reduce within a key

2014-06-20 Thread Michael Malak
How about a treeReduceByKey? :-) On Friday, June 20, 2014 11:55 AM, DB Tsai wrote: Currently, the reduce operation combines the result from mapper sequentially, so it's O(n). Xiangrui is working on treeReduce which is O(log(n)). Based on the benchmark, it dramatically increase the performan

GraphX triplets on 5-node graph

2014-05-28 Thread Michael Malak
Shouldn't I be seeing N2 and N4 in the output below? (Spark 0.9.0 REPL) Or am I missing something fundamental? val nodes = sc.parallelize(Array((1L, "N1"), (2L, "N2"), (3L, "N3"), (4L, "N4"), (5L, "N5"))) val edges = sc.parallelize(Array(Edge(1L, 2L, "E1"), Edge(1L, 3L, "E2"), Edge(2L, 4L, "E

[jira] [Resolved] (SPARK-1836) REPL $outer type mismatch causes lookup() and equals() problems

2014-05-28 Thread Michael Malak (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Malak resolved SPARK-1836. -- Resolution: Duplicate > REPL $outer type mismatch causes lookup() and equals() probl

[jira] [Commented] (SPARK-1199) Type mismatch in Spark shell when using case class defined in shell

2014-05-28 Thread Michael Malak (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14011492#comment-14011492 ] Michael Malak commented on SPARK-1199: -- See also additional test cases in h

Re: rdd ordering gets scrambled

2014-05-28 Thread Michael Malak
Mohit Jaggi: A workaround is to use zipWithIndex (to appear in Spark 1.0, but if you're still on 0.9x you can swipe the code from  https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/ZippedWithIndexRDD.scala  ), map it to (x => (x._2,x._1)) and then sortByKey. Sp

[jira] [Commented] (SPARK-1867) Spark Documentation Error causes java.lang.IllegalStateException: unread block data

2014-05-23 Thread Michael Malak (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14007565#comment-14007565 ] Michael Malak commented on SPARK-1867: -- Thank you, sam, that fixed it for me!

[jira] [Commented] (SPARK-1867) Spark Documentation Error causes java.lang.IllegalStateException: unread block data

2014-05-23 Thread Michael Malak (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14007238#comment-14007238 ] Michael Malak commented on SPARK-1867: -- I, too, have run into this issue, and I

Re: [VOTE] Release Apache Spark 1.0.0 (rc5)

2014-05-17 Thread Michael Malak
While developers may appreciate "1.0 == API stability," I'm not sure that will be the understanding of the VP who gives the green light to a Spark-based development effort. I fear a bug that silently produces erroneous results will be perceived like the FDIV bug, but in this case without the mo

[jira] [Commented] (SPARK-1836) REPL $outer type mismatch causes lookup() and equals() problems

2014-05-16 Thread Michael Malak (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13998807#comment-13998807 ] Michael Malak commented on SPARK-1836: -- Michael Ambrust: Indeed. Do you thi

[jira] [Created] (SPARK-1857) map() with lookup() causes exception

2014-05-16 Thread Michael Malak (JIRA)
Michael Malak created SPARK-1857: Summary: map() with lookup() causes exception Key: SPARK-1857 URL: https://issues.apache.org/jira/browse/SPARK-1857 Project: Spark Issue Type: Bug

[jira] [Updated] (SPARK-1836) REPL $outer type mismatch causes lookup() and equals() problems

2014-05-16 Thread Michael Malak (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Malak updated SPARK-1836: - Description: Anand Avati partially traced the cause to REPL wrapping classes in $outer classes

map() + lookup() exception

2014-05-15 Thread Michael Malak
When using map() and lookup() in conjunction, I get an exception (each independently works fine). I'm using Spark 0.9.0/Scala 2.10.3 val a = sc.parallelize(Array(11)) val m = sc.parallelize(Array((11,21))) a.map(m.lookup(_)(0)).collect 14/05/14 15:03:35 ERROR Executor: Exception in task ID 23 sc

Serializable different behavior Spark Shell vs. Scala Shell

2014-05-14 Thread Michael Malak
I'm seeing different Serializable behavior in Spark Shell vs. Scala Shell. In the Spark Shell, equals() fails when I use the canonical equals() pattern of match{}, but works when I subsitute with isInstanceOf[]. I am using Spark 0.9.0/Scala 2.10.3. Is this a bug? Spark Shell (equals uses match

Class-based key in groupByKey?

2014-05-13 Thread Michael Malak
Is it permissible to use a custom class (as opposed to e.g. the built-in String or Int) for the key in groupByKey? It doesn't seem to be working for me on Spark 0.9.0/Scala 2.10.3: import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ class C(val s:String) extends Serializ

Re: Serializable different behavior Spark Shell vs. Scala Shell

2014-05-13 Thread Michael Malak
12))) r: org.apache.spark.rdd.RDD[(C, Int)] = ParallelCollectionRDD[3] at parallelize at :14 scala> r.lookup(new C("a")) :17: error: type mismatch;  found   : C  required: C   r.lookup(new C("a"))    ^ On Tuesday, May 13, 2014 3:05 PM, Ana

Serializable different behavior Spark Shell vs. Scala Shell

2014-05-13 Thread Michael Malak
Reposting here on dev since I didn't see a response on user: I'm seeing different Serializable behavior in Spark Shell vs. Scala Shell. In the Spark Shell, equals() fails when I use the canonical equals() pattern of match{}, but works when I subsitute with isInstanceOf[]. I am using Spark 0.9.0

[jira] [Created] (SPARK-1817) RDD zip erroneous when partitions do not divide RDD count

2014-05-13 Thread Michael Malak (JIRA)
Michael Malak created SPARK-1817: Summary: RDD zip erroneous when partitions do not divide RDD count Key: SPARK-1817 URL: https://issues.apache.org/jira/browse/SPARK-1817 Project: Spark

Re: Bug when zip with longs and too many partitions?

2014-05-12 Thread Michael Malak
s the ASF Jira system will let me reset my password. On Sunday, May 11, 2014 4:40 AM, Michael Malak wrote: Is this a bug? scala> sc.parallelize(1 to 2,4).zip(sc.parallelize(11 to 12,4)).collect res0: Array[(Int, Int)] = Array((1,11), (2,12)) scala> sc.parallelize(1L to 2L,4).zip(sc.par

Bug when zip with longs and too many partitions?

2014-05-12 Thread Michael Malak
Is this a bug? scala> sc.parallelize(1 to 2,4).zip(sc.parallelize(11 to 12,4)).collect res0: Array[(Int, Int)] = Array((1,11), (2,12)) scala> sc.parallelize(1L to 2L,4).zip(sc.parallelize(11 to 12,4)).collect res1: Array[(Long, Int)] = Array((2,11))

Re: Opinions stratosphere

2014-05-02 Thread Michael Malak
"looks like Spark outperforms Stratosphere fairly consistently in the experiments" There was one exception the paper noted, which was when memory resources were constrained. In that case, Stratosphere seemed to have degraded more gracefully than Spark, but the author did not explore it deeper.

Re: Kafka not shutting down cleanly; Actor serializtion?

2013-12-09 Thread Michael Malak
UTION. From: Michael Malak To: "dev@spark.incubator.apache.org" Sent: Thursday, September 26, 2013 12:27 PM Subject: Kafka not shutting down cleanly; Actor serializtion? Tathagata: I don't believe Kafka streams are being shut down cleanly, which im

Re: Spark Streaming architecture question - shared memory model

2013-09-30 Thread Michael Malak
Domingo Mihovilovic writes: > Imagine that you are processing a stream data at high speed and needs to >build, update, > and access some memory data structure where the "model" is stored.  Normally this is done with updateStateByKey, which maintains an RDD behind the sce

Kafka not shutting down cleanly; Actor serializtion?

2013-09-26 Thread Michael Malak
Tathagata: I don't believe Kafka streams are being shut down cleanly, which implies that the most recent Kafka offsets are not being committed back to Zookeeper, which implies starting/restarting a Spark Streaming process would result in duplicate events. The simple Spark Streaming code (runn

Re: UDFs with package names

2013-07-31 Thread Michael Malak
Yup, it was the directory structure com/mystuff/whateverUDF.class that was missing.  Thought I had tried that before posting my question, but... Thanks for your help! From: Edward Capriolo To: "user@hive.apache.org" ; Michael Malak Sent: Tuesda

UDFs with package names

2013-07-30 Thread Michael Malak
Thus far, I've been able to create Hive UDFs, but now I need to define them within a Java package name (as opposed to the "default" Java package as I had been doing), but once I do that, I'm no longer able to load them into Hive. First off, this works: add jar /usr/lib/hive/lib/hive-contrib-0.1

Re: Best Performance on Large Scale Join

2013-07-29 Thread Michael Malak
Perhaps you can first create a temp table that contains only the records that will match?  See the UNION ALL trick at http://www.mail-archive.com/hive-user@hadoop.apache.org/msg01906.html From: Brad Ruderman To: user@hive.apache.org Sent: Monday, July 29, 201

Re: Oracle to Hive

2013-07-10 Thread Michael Malak
Untested: SELECT a.c100, a.c300, b.c400   FROM t1 a   JOIN t2 b   ON a.c200 = b.c200   JOIN (SELECT DISTINCT a.c100           FROM t1 a2           JOIN t2 b2           ON a2.c200 = b2.c200         WHERE b2.c400 >= SYSDATE - 1) a3   ON a.c100 = a3.c100   WHERE b.c400 >= SYSDATE - 1    AND a.c300 =

Re: How Can I store the Hive query result in one file ?

2013-07-04 Thread Michael Malak
I have found that for output larger than a few GB, redirecting stdout results in an incomplete file.  For very large output, I do CREATE TABLE MYTABLE AS SELECT ... and then copy the resulting HDFS files directly out of  /user/hive/warehouse. From: Bertrand De

Re: Fwd: Need urgent help in hive query

2013-06-28 Thread Michael Malak
Just copy and paste the whole long expressions to their second occurrences. From: dyuti a To: user@hive.apache.org Sent: Friday, June 28, 2013 10:58 AM Subject: Fwd: Need urgent help in hive query Hi Experts, I'm trying with the below SQL query in Hive, whi

Re: how to combine some rows into 1 row in hive

2013-06-23 Thread Michael Malak
ang wrote: Thanks Michael! That worked without modification! > > > >On Sat, Jun 22, 2013 at 5:05 PM, Michael Malak wrote: > >Or, the single-language (HiveQL) alternative might be (i.e. I haven't tested >it): >>  >>select f1, >>   f2, >> 

Re: how to combine some rows into 1 row in hive

2013-06-22 Thread Michael Malak
Or, the single-language (HiveQL) alternative might be (i.e. I haven't tested it):   select f1,    f2,    if(max(if(f3='P',f4,null)) is null,0,max(if(f3='P',f4,null))) pf4,   if(max(if(f3='P',f5,null)) is null,0,max(if(f3='P',f5,null))) pf5,    if(max(if(f3='N',f4,null)) is null,0,

Re: INSERT non-static data to array?

2013-06-20 Thread Michael Malak
thing. From: Edward Capriolo To: "user@hive.apache.org" ; Michael Malak Sent: Thursday, June 20, 2013 9:15 PM Subject: Re: INSERT non-static data to array? i think you could select into as sub query and then use lateral view.not exactly the same but somethin

Re: INSERT non-static data to array?

2013-06-20 Thread Michael Malak
I've created https://issues.apache.org/jira/browse/HIVE-4771 to track this issue. - Original Message - From: Michael Malak To: "user@hive.apache.org" Cc: Sent: Wednesday, June 19, 2013 2:35 PM Subject: Re: INSERT non-static data to array? The example code for inlin

[jira] [Created] (HIVE-4771) Support subqueries in INSERT for array types

2013-06-20 Thread Michael Malak (JIRA)
Michael Malak created HIVE-4771: --- Summary: Support subqueries in INSERT for array types Key: HIVE-4771 URL: https://issues.apache.org/jira/browse/HIVE-4771 Project: Hive Issue Type

Re: INSERT non-static data to array?

2013-06-19 Thread Michael Malak
c int[]); INSERT INTO table_a   SELECT a, b, ARRAY(SELECT c FROM table_c WHERE table_c.parent = table_b.id)   FROM table_b From: Edward Capriolo To: "user@hive.apache.org" ; Michael Malak Sent: Wednesday, June 19, 2013 2:06 PM Subject: Re: INSERT non

INSERT non-static data to array?

2013-06-19 Thread Michael Malak
Is the only way to INSERT data into a column of type array<> to load data from a pre-existing file, to use hard-coded values in the INSERT statement, or copy an entire array verbatim from another table?  I.e. I'm assuming that a) SQL1999 array INSERT via subquery is not (yet) implemented in Hive

Re: Hive Group By Limitations

2013-05-06 Thread Michael Malak
--- On Mon, 5/6/13, Peter Chu wrote: > In Hive, I cannot perform a SELECT GROUP BY on fields not in the GROUP BY > clause. Although MySQL allows it, it is not ANSI SQL. http://stackoverflow.com/questions/1225144/why-does-mysql-allow-group-by-queries-without-aggregate-functions

Re: Hive QL - NOT IN, NOT EXIST

2013-05-05 Thread Michael Malak
--- On Sun, 5/5/13, Peter Chu wrote: > I am wondering if there is any way to do this without resorting to > using left outer join and finding nulls. I have found this to be an acceptable substitute. Is it not working for you?

[jira] [Commented] (HIVE-3528) Avro SerDe doesn't handle serializing Nullable types that require access to a Schema

2013-02-20 Thread Michael Malak (JIRA)
[ https://issues.apache.org/jira/browse/HIVE-3528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13582664#comment-13582664 ] Michael Malak commented on HIVE-3528: - As noted in the first comment from h

[jira] [Commented] (HIVE-4022) Structs and struct fields cannot be NULL in INSERT statements

2013-02-20 Thread Michael Malak (JIRA)
[ https://issues.apache.org/jira/browse/HIVE-4022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13582662#comment-13582662 ] Michael Malak commented on HIVE-4022: - Note that there is a workaround for the cas

[jira] [Updated] (HIVE-4022) Structs and struct fields cannot be NULL in INSERT statements

2013-02-19 Thread Michael Malak (JIRA)
[ https://issues.apache.org/jira/browse/HIVE-4022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Malak updated HIVE-4022: Description: Originally thought to be Avro-specific, and first noted with respect to HIVE-3528

Re: NULLable STRUCTs

2013-02-19 Thread Michael Malak
If no one has any objection, I'm going to update HIVE-4022, which I entered a week ago when I thought the behavior was Avro-specific, to indicate it actually affects even native Hive tables. https://issues.apache.org/jira/browse/HIVE-4022 --- On Fri, 2/15/13, Michael Malak wrote: &

NULLable STRUCTs

2013-02-15 Thread Michael Malak
It seems that all Hive columns (at least those of primitive types) are always NULLable? What about columns of type STRUCT? The following: echo 1,2 >twovalues.csv hive CREATE TABLE tc (x INT, y INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ','; LOAD DATA LOCAL INPATH 'twovalues.csv' INTO TABLE

[jira] [Created] (HIVE-4022) Avro SerDe queries don't handle hard-coded nulls for optional/nullable structs

2013-02-14 Thread Michael Malak (JIRA)
Michael Malak created HIVE-4022: --- Summary: Avro SerDe queries don't handle hard-coded nulls for optional/nullable structs Key: HIVE-4022 URL: https://issues.apache.org/jira/browse/HIVE-4022 Pr

[jira] [Commented] (HIVE-3528) Avro SerDe doesn't handle serializing Nullable types that require access to a Schema

2013-02-14 Thread Michael Malak (JIRA)
[ https://issues.apache.org/jira/browse/HIVE-3528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13578783#comment-13578783 ] Michael Malak commented on HIVE-3528: - Sean: OK, I've researched the proble

[jira] [Commented] (HIVE-3528) Avro SerDe doesn't handle serializing Nullable types that require access to a Schema

2013-02-14 Thread Michael Malak (JIRA)
[ https://issues.apache.org/jira/browse/HIVE-3528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13578538#comment-13578538 ] Michael Malak commented on HIVE-3528: - Sean: I mean https://github.com/apache/

[jira] [Commented] (HIVE-3528) Avro SerDe doesn't handle serializing Nullable types that require access to a Schema

2013-02-14 Thread Michael Malak (JIRA)
[ https://issues.apache.org/jira/browse/HIVE-3528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13578523#comment-13578523 ] Michael Malak commented on HIVE-3528: - I've tried the latest Avro SerDe fr

Re: INSERT INTO table with STRUCT, SELECT FROM

2013-02-13 Thread Michael Malak
o I would write to a different directory and then move the files over... dean On Wed, Feb 13, 2013 at 1:26 PM, Michael Malak wrote: Is it possible to INSERT INTO TABLE t SELECT FROM where t has a column with a STRUCT? Based on http://grokbase.com/t/hive/user/109r87hh3e/insert-data-into-a-co

[jira] [Commented] (AVRO-1035) Add the possibility to append to existing avro files

2013-02-07 Thread Michael Malak (JIRA)
[ https://issues.apache.org/jira/browse/AVRO-1035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13573711#comment-13573711 ] Michael Malak commented on AVRO-1035: - ha...@cloudera.com has provided example cod

Re: Is it possible to append to an already existing avro file

2013-02-07 Thread Michael Malak
output streams to let Avro take it as > append-able? > I don't think its possible for Avro to carry it since Avro > (core) does > not reverse-depend on Hadoop. Should we document it > somewhere though? > Do you have any ideas on the best place to do that? > > On Thu,

  1   2   >