Re: Minimum cost flow problem solving in Spark

2017-09-13 Thread Michael Malak
You might be interested in "Maximum Flow implementation on Spark GraphX" done by a Colorado School of Mines grad student a couple of years ago. http://datascienceassn.org/2016-01-27-maximum-flow-implementation-spark-graphx From: Swapnil Shinde

Re: Shortest path with directed and weighted graphs

2016-10-24 Thread Michael Malak
Chapter 6 of my book implements Dijkstra's Algorithm. The source code is available to download for free.  https://www.manning.com/books/spark-graphx-in-action From: Brian Wilson To: user@spark.apache.org Sent: Monday, October 24, 2016 7:11 AM Subject:

Re: GraphX drawing algorithm

2016-09-11 Thread Michael Malak
In chapter 10 of Spark GraphX In Action, we describe how to use Zeppelin with d3.js to render graphs using d3's force-directed rendering algorithm. The source code can be downloaded for free from  https://www.manning.com/books/spark-graphx-in-action From: agc studio

Re: Where is DataFrame.scala in 2.0?

2016-06-03 Thread Michael Malak
It's been reduced to a single line of code. http://technicaltidbit.blogspot.com/2016/03/dataframedataset-swap-places-in-spark-20.html From: Gerhard Fiedler To: "dev@spark.apache.org" Sent: Friday, June 3, 2016 9:01 AM Subject: Where

Re: GraphX Java API

2016-05-30 Thread Michael Malak
Yes, it is possible to use GraphX from Java but it requires 10x the amount of code and involves using obscure typing and pre-defined lambda prototype facilities. I give an example of it in my book, the source code for which can be downloaded for free from 

Re: Adhoc queries on Spark 2.0 with Structured Streaming

2016-05-06 Thread Michael Malak
At first glance, it looks like the only streaming data sources available out of the box from the github master branch are  https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala  and 

Re: Spark 2.0 forthcoming features

2016-04-20 Thread Michael Malak
http://go.databricks.com/apache-spark-2.0-presented-by-databricks-co-founder-reynold-xin From: Sourav Mazumder To: user Sent: Wednesday, April 20, 2016 11:07 AM Subject: Spark 2.0 forthcoming features Hi All, Is there

Re: Apache Flink

2016-04-17 Thread Michael Malak
As with all history, "what if"s are not scientifically testable hypotheses, but my speculation is the energy (VCs, startups, big Internet companies, universities) within Silicon Valley contrasted to Germany. From: Mich Talebzadeh <mich.talebza...@gmail.com> To: Michael

Re: Apache Flink

2016-04-17 Thread Michael Malak
There have been commercial CEP solutions for decades, including from my employer. From: Mich Talebzadeh To: Mark Hamstra Cc: Corey Nolet ; "user @spark" Sent: Sunday, April 17, 2016 3:48 PM

Re: Apache Flink

2016-04-17 Thread Michael Malak
In terms of publication date, a paper on Nephele was published in 2009, prior to the 2010 USENIX paper on Spark. Nephele is the execution engine of Stratosphere, which became Flink. From: Mark Hamstra To: Mich Talebzadeh Cc: Corey

Re: [discuss] using deep learning to improve Spark

2016-04-01 Thread Michael Malak
I see you've been burning the midnight oil. From: Reynold Xin To: "dev@spark.apache.org" Sent: Friday, April 1, 2016 1:15 AM Subject: [discuss] using deep learning to improve Spark Hi all, Hope you all enjoyed the Tesla 3 unveiling

Re: Spark with Druid

2016-03-23 Thread Michael Malak
Will Spark 2.0 Structured Streaming obviate some of the Druid/Spark use cases? From: Raymond Honderdors To: "yuzhih...@gmail.com" Cc: "user@spark.apache.org" Sent: Wednesday, March 23, 2016 8:43 AM Subject:

Re: [discuss] DataFrame vs Dataset in Spark 2.0

2016-02-25 Thread Michael Malak
Would it make sense (in terms of feasibility, code organization, and politically) to have a JavaDataFrame, as a way to isolate the 1000+ extra lines to a Java compatibility layer/class? From: Reynold Xin To: "dev@spark.apache.org" Sent:

[jira] [Commented] (SPARK-3789) [GRAPHX] Python bindings for GraphX

2015-11-04 Thread Michael Malak (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-3789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14990391#comment-14990391 ] Michael Malak commented on SPARK-3789: -- My publisher tells me the MEAP for Spark GraphX In Action has

[jira] [Updated] (SPARK-11278) PageRank fails with unified memory manager

2015-10-23 Thread Michael Malak (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-11278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Malak updated SPARK-11278: -- Component/s: GraphX > PageRank fails with unified memory mana

[jira] [Commented] (SPARK-2365) Add IndexedRDD, an efficient updatable key-value store

2015-10-09 Thread Michael Malak (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-2365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14950679#comment-14950679 ] Michael Malak commented on SPARK-2365: -- It's off-topic of IndexedRDD, but you can have a look

[jira] [Commented] (SPARK-10939) Misaligned data with RDD.zip after repartition

2015-10-08 Thread Michael Malak (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-10939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14948596#comment-14948596 ] Michael Malak commented on SPARK-10939: --- Here Matei explains the explicit design decision to prefer

[jira] [Created] (SPARK-10972) UDFs in SQL joins

2015-10-07 Thread Michael Malak (JIRA)
Michael Malak created SPARK-10972: - Summary: UDFs in SQL joins Key: SPARK-10972 URL: https://issues.apache.org/jira/browse/SPARK-10972 Project: Spark Issue Type: New Feature

[jira] [Updated] (SPARK-10972) UDFs in SQL joins

2015-10-07 Thread Michael Malak (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-10972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Malak updated SPARK-10972: -- Description: Currently expressions used to .join() in DataFrames are limited to column names

[jira] [Commented] (SPARK-10722) Uncaught exception: RDDBlockId not found in driver-heartbeater

2015-09-27 Thread Michael Malak (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-10722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14909883#comment-14909883 ] Michael Malak commented on SPARK-10722: --- I have seen this in a small Hello World type program

[jira] [Commented] (SPARK-10489) GraphX dataframe wrapper

2015-09-10 Thread Michael Malak (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-10489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14739681#comment-14739681 ] Michael Malak commented on SPARK-10489: --- Feynman Liang: Link https://github.com/databricks/spark

Re: Build k-NN graph for large dataset

2015-08-26 Thread Michael Malak
Yes. And a paper that describes using grids (actually varying grids) is  http://research.microsoft.com/en-us/um/people/jingdw/pubs%5CCVPR12-GraphConstruction.pdf  In the Spark GraphX In Action book that Robin East and I are writing, we implement a drastically simplified version of this in chapter

RE: What does Spark is not just MapReduce mean? Isn't every Spark job a form of MapReduce?

2015-06-28 Thread Michael Malak
I would also add, from a data locality theoretic standpoint, mapPartitions() provides for node-local computation that plain old map-reduce does not. From my Android phone on T-Mobile. The first nationwide 4G network. Original message From: Ashic Mahtab as...@live.com Date:

Re: Why Spark is much faster than Hadoop MapReduce even on disk

2015-04-27 Thread Michael Malak
http://www.datascienceassn.org/content/making-sense-making-sense-performance-data-analytics-frameworks   From: bit1...@163.com bit1...@163.com To: user user@spark.apache.org Sent: Monday, April 27, 2015 8:33 PM Subject: Why Spark is much faster than Hadoop MapReduce even on disk

Re: How to restrict foreach on a streaming RDD only once upon receiver completion

2015-04-06 Thread Michael Malak
You could have your receiver send a magic value when it is done. I discuss this Spark Streaming pattern in my presentation Spark Gotchas and Anti-Patterns. In the PDF version, it's slides 34-36.http://www.datascienceassn.org/content/2014-11-05-spark-gotchas-and-anti-patterns-julia-language

[jira] [Created] (SPARK-6710) Wrong initial bias in GraphX SVDPlusPlus

2015-04-04 Thread Michael Malak (JIRA)
Michael Malak created SPARK-6710: Summary: Wrong initial bias in GraphX SVDPlusPlus Key: SPARK-6710 URL: https://issues.apache.org/jira/browse/SPARK-6710 Project: Spark Issue Type: Bug

Wrong initial bias in GraphX SVDPlusPlus?

2015-04-03 Thread Michael Malak
I believe that in the initialization portion of GraphX SVDPlusPluS, the initialization of biases is incorrect. Specifically, in line https://github.com/apache/spark/blob/master/graphx/src/main/scala/org/apache/spark/graphx/lib/SVDPlusPlus.scala#L96 instead of (vd._1, vd._2, msg.get._2 /

Spark GraphX In Action on documentation page?

2015-03-24 Thread Michael Malak
Can my new book, Spark GraphX In Action, which is currently in MEAP http://manning.com/malak/, be added to https://spark.apache.org/documentation.html and, if appropriate, to https://spark.apache.org/graphx/ ? Michael Malak

[jira] [Commented] (SPARK-6388) Spark 1.3 + Hadoop 2.6 Can't work on Java 8_40

2015-03-17 Thread Michael Malak (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-6388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14365758#comment-14365758 ] Michael Malak commented on SPARK-6388: -- Isn't it Hadoop 2.7 that is supposed

textFile() ordering and header rows

2015-02-22 Thread Michael Malak
Since RDDs are generally unordered, aren't things like textFile().first() not guaranteed to return the first row (such as looking for a header row)? If so, doesn't that make the example in http://spark.apache.org/docs/1.2.1/quick-start.html#basics misleading?

[jira] [Commented] (SPARK-4279) Implementing TinkerPop on top of GraphX

2015-02-06 Thread Michael Malak (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-4279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14309459#comment-14309459 ] Michael Malak commented on SPARK-4279: -- Is there another place where I might be able

Word2Vec IndexedRDD

2015-02-01 Thread Michael Malak
1. Is IndexedRDD planned for 1.3? https://issues.apache.org/jira/browse/SPARK-2365 2. Once IndexedRDD is in, is it planned to convert Word2VecModel to it from its current Map[String,Array[Float]]?

Re: spark challenge: zip with next???

2015-01-30 Thread Michael Malak
But isn't foldLeft() overkill for the originally stated use case of max diff of adjacent pairs? Isn't foldLeft() for recursive non-commutative non-associative accumulation as opposed to an embarrassingly parallel operation such as this one? This use case reminds me of FIR filtering in DSP. It

Re: renaming SchemaRDD - DataFrame

2015-01-27 Thread Michael Malak
Message - From: Evan R. Sparks evan.spa...@gmail.com To: Matei Zaharia matei.zaha...@gmail.com Cc: Koert Kuipers ko...@tresata.com; Michael Malak michaelma...@yahoo.com; Patrick Wendell pwend...@gmail.com; Reynold Xin r...@databricks.com; dev@spark.apache.org dev@spark.apache.org Sent: Tuesday

[jira] [Created] (SPARK-5343) ShortestPaths traverses backwards

2015-01-20 Thread Michael Malak (JIRA)
Michael Malak created SPARK-5343: Summary: ShortestPaths traverses backwards Key: SPARK-5343 URL: https://issues.apache.org/jira/browse/SPARK-5343 Project: Spark Issue Type: Bug

Re: GraphX ShortestPaths backwards?

2015-01-20 Thread Michael Malak
I created https://issues.apache.org/jira/browse/SPARK-5343 for this. - Original Message - From: Michael Malak michaelma...@yahoo.com To: dev@spark.apache.org dev@spark.apache.org Cc: Sent: Monday, January 19, 2015 5:09 PM Subject: GraphX ShortestPaths backwards? GraphX ShortestPaths

GraphX ShortestPaths backwards?

2015-01-19 Thread Michael Malak
GraphX ShortestPaths seems to be following edges backwards instead of forwards: import org.apache.spark.graphx._ val g = Graph(sc.makeRDD(Array((1L,), (2L,), (3L,))), sc.makeRDD(Array(Edge(1L,2L,), Edge(2L,3L, lib.ShortestPaths.run(g,Array(3)).vertices.collect res1:

Re: GraphX vertex partition/location strategy

2015-01-19 Thread Michael Malak
But wouldn't the gain be greater under something similar to EdgePartition1D (but perhaps better load-balanced based on number of edges for each vertex) and an algorithm that primarily follows edges in the forward direction? From: Ankur Dave ankurd...@gmail.com To: Michael Malak michaelma

GraphX vertex partition/location strategy

2015-01-19 Thread Michael Malak
Does GraphX make an effort to co-locate vertices onto the same workers as the majority (or even some) of its edges? - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail:

GraphX doc: triangleCount() requirement overstatement?

2015-01-18 Thread Michael Malak
According to: https://spark.apache.org/docs/1.2.0/graphx-programming-guide.html#triangle-counting Note that TriangleCount requires the edges to be in canonical orientation (srcId dstId) But isn't this overstating the requirement? Isn't the requirement really that IF there are duplicate

Re: GraphX rmatGraph hangs

2015-01-04 Thread Michael Malak
Thank you. I created https://issues.apache.org/jira/browse/SPARK-5064 - Original Message - From: xhudik xhu...@gmail.com To: dev@spark.apache.org Cc: Sent: Saturday, January 3, 2015 2:04 PM Subject: Re: GraphX rmatGraph hangs Hi Michael, yes, I can confirm the behavior. It get stuck

GraphX rmatGraph hangs

2015-01-03 Thread Michael Malak
The following single line just hangs, when executed in either Spark Shell or standalone: org.apache.spark.graphx.util.GraphGenerators.rmatGraph(sc, 4, 8) It just outputs 0 edges and then locks up. The only other information I've found via Google is:

[jira] [Created] (SPARK-5064) GraphX rmatGraph hangs

2015-01-03 Thread Michael Malak (JIRA)
Michael Malak created SPARK-5064: Summary: GraphX rmatGraph hangs Key: SPARK-5064 URL: https://issues.apache.org/jira/browse/SPARK-5064 Project: Spark Issue Type: Bug Components

Re: Rdd of Rdds

2014-10-22 Thread Michael Malak
On Wednesday, October 22, 2014 9:06 AM, Sean Owen so...@cloudera.com wrote: No, there's no such thing as an RDD of RDDs in Spark. Here though, why not just operate on an RDD of Lists? or a List of RDDs? Usually one of these two is the right approach whenever you feel inclined to operate on an

Re: UpdateStateByKey - How to improve performance?

2014-08-06 Thread Michael Malak
Depending on the density of your keys, the alternative signature def updateStateByKey[S](updateFunc: (Iterator[(K, Seq[V], Option[S])]) ? Iterator[(K, S)], partitioner: Partitioner, rememberPartitioner: Boolean)(implicit arg0: ClassTag[S]): DStream[(K, S)] at least iterates by key rather than

Re: relationship of RDD[Array[String]] to Array[Array[String]]

2014-07-21 Thread Michael Malak
It's really more of a Scala question than a Spark question, but the standard OO (not Scala-specific) way is to create your own custom supertype (e.g. MyCollectionTrait), inherited/implemented by two concrete classes (e.g. MyRDD and MyArray), each of which manually forwards method calls to the

15 new MLlib algorithms

2014-07-09 Thread Michael Malak
At Spark Summit, Patrick Wendell indicated the number of MLlib algorithms would roughly double in 1.1 from the current approx. 15. http://spark-summit.org/wp-content/uploads/2014/07/Future-of-Spark-Patrick-Wendell.pdf What are the planned additional algorithms? In Jira, I only see two when

Re: parallel Reduce within a key

2014-06-20 Thread Michael Malak
How about a treeReduceByKey? :-) On Friday, June 20, 2014 11:55 AM, DB Tsai dbt...@stanford.edu wrote: Currently, the reduce operation combines the result from mapper sequentially, so it's O(n). Xiangrui is working on treeReduce which is O(log(n)). Based on the benchmark, it dramatically

GraphX triplets on 5-node graph

2014-05-29 Thread Michael Malak
Shouldn't I be seeing N2 and N4 in the output below? (Spark 0.9.0 REPL) Or am I missing something fundamental? val nodes = sc.parallelize(Array((1L, N1), (2L, N2), (3L, N3), (4L, N4), (5L, N5))) val edges = sc.parallelize(Array(Edge(1L, 2L, E1), Edge(1L, 3L, E2), Edge(2L, 4L, E3), Edge(3L,

[jira] [Commented] (SPARK-1199) Type mismatch in Spark shell when using case class defined in shell

2014-05-28 Thread Michael Malak (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14011492#comment-14011492 ] Michael Malak commented on SPARK-1199: -- See also additional test cases in https

[jira] [Resolved] (SPARK-1836) REPL $outer type mismatch causes lookup() and equals() problems

2014-05-28 Thread Michael Malak (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Malak resolved SPARK-1836. -- Resolution: Duplicate REPL $outer type mismatch causes lookup() and equals() problems

Re: rdd ordering gets scrambled

2014-05-28 Thread Michael Malak
Mohit Jaggi: A workaround is to use zipWithIndex (to appear in Spark 1.0, but if you're still on 0.9x you can swipe the code from  https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/ZippedWithIndexRDD.scala  ), map it to (x = (x._2,x._1)) and then sortByKey.

[jira] [Commented] (SPARK-1867) Spark Documentation Error causes java.lang.IllegalStateException: unread block data

2014-05-23 Thread Michael Malak (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14007565#comment-14007565 ] Michael Malak commented on SPARK-1867: -- Thank you, sam, that fixed it for me! FYI, I

Re: [VOTE] Release Apache Spark 1.0.0 (rc5)

2014-05-17 Thread Michael Malak
While developers may appreciate 1.0 == API stability, I'm not sure that will be the understanding of the VP who gives the green light to a Spark-based development effort. I fear a bug that silently produces erroneous results will be perceived like the FDIV bug, but in this case without the

[jira] [Updated] (SPARK-1836) REPL $outer type mismatch causes lookup() and equals() problems

2014-05-16 Thread Michael Malak (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Malak updated SPARK-1836: - Description: Anand Avati partially traced the cause to REPL wrapping classes in $outer classes

[jira] [Commented] (SPARK-1836) REPL $outer type mismatch causes lookup() and equals() problems

2014-05-16 Thread Michael Malak (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13998807#comment-13998807 ] Michael Malak commented on SPARK-1836: -- Michael Ambrust: Indeed. Do you think I

[jira] [Created] (SPARK-1817) RDD zip erroneous when partitions do not divide RDD count

2014-05-13 Thread Michael Malak (JIRA)
Michael Malak created SPARK-1817: Summary: RDD zip erroneous when partitions do not divide RDD count Key: SPARK-1817 URL: https://issues.apache.org/jira/browse/SPARK-1817 Project: Spark

Serializable different behavior Spark Shell vs. Scala Shell

2014-05-13 Thread Michael Malak
Reposting here on dev since I didn't see a response on user: I'm seeing different Serializable behavior in Spark Shell vs. Scala Shell. In the Spark Shell, equals() fails when I use the canonical equals() pattern of match{}, but works when I subsitute with isInstanceOf[]. I am using Spark

Re: Serializable different behavior Spark Shell vs. Scala Shell

2014-05-13 Thread Michael Malak
:26 AM, Michael Malak michaelma...@yahoo.com wrote: Reposting here on dev since I didn't see a response on user: I'm seeing different Serializable behavior in Spark Shell vs. Scala Shell. In the Spark Shell, equals() fails when I use the canonical equals() pattern of match{}, but works when I

Re: Bug when zip with longs and too many partitions?

2014-05-12 Thread Michael Malak
as the ASF Jira system will let me reset my password. On Sunday, May 11, 2014 4:40 AM, Michael Malak michaelma...@yahoo.com wrote: Is this a bug? scala sc.parallelize(1 to 2,4).zip(sc.parallelize(11 to 12,4)).collect res0: Array[(Int, Int)] = Array((1,11), (2,12)) scala sc.parallelize(1L to 2L,4

Re: Opinions stratosphere

2014-05-02 Thread Michael Malak
looks like Spark outperforms Stratosphere fairly consistently in the experiments There was one exception the paper noted, which was when memory resources were constrained. In that case, Stratosphere seemed to have degraded more gracefully than Spark, but the author did not explore it deeper.

Re: UDFs with package names

2013-07-31 Thread Michael Malak
; Michael Malak michaelma...@yahoo.com Sent: Tuesday, July 30, 2013 7:06 PM Subject: Re: UDFs with package names It might be a better idea to use your own package com.mystuff.x. You might be running into an issue where java is not finding the file because it assumes the relation between package

UDFs with package names

2013-07-30 Thread Michael Malak
Thus far, I've been able to create Hive UDFs, but now I need to define them within a Java package name (as opposed to the default Java package as I had been doing), but once I do that, I'm no longer able to load them into Hive. First off, this works: add jar

Re: Best Performance on Large Scale Join

2013-07-29 Thread Michael Malak
Perhaps you can first create a temp table that contains only the records that will match?  See the UNION ALL trick at http://www.mail-archive.com/hive-user@hadoop.apache.org/msg01906.html From: Brad Ruderman bruder...@radiumone.com To: user@hive.apache.org

Re: Oracle to Hive

2013-07-10 Thread Michael Malak
Untested: SELECT a.c100, a.c300, b.c400   FROM t1 a   JOIN t2 b   ON a.c200 = b.c200   JOIN (SELECT DISTINCT a.c100           FROM t1 a2           JOIN t2 b2           ON a2.c200 = b2.c200         WHERE b2.c400 = SYSDATE - 1) a3   ON a.c100 = a3.c100   WHERE b.c400 = SYSDATE - 1    AND a.c300 = 0

Re: How Can I store the Hive query result in one file ?

2013-07-04 Thread Michael Malak
I have found that for output larger than a few GB, redirecting stdout results in an incomplete file.  For very large output, I do CREATE TABLE MYTABLE AS SELECT ... and then copy the resulting HDFS files directly out of  /user/hive/warehouse. From: Bertrand

Re: Fwd: Need urgent help in hive query

2013-06-28 Thread Michael Malak
Just copy and paste the whole long expressions to their second occurrences. From: dyuti a hadoop.hiv...@gmail.com To: user@hive.apache.org Sent: Friday, June 28, 2013 10:58 AM Subject: Fwd: Need urgent help in hive query Hi Experts, I'm trying with the

Re: INSERT non-static data to array?

2013-06-20 Thread Michael Malak
I've created https://issues.apache.org/jira/browse/HIVE-4771 to track this issue. - Original Message - From: Michael Malak michaelma...@yahoo.com To: user@hive.apache.org user@hive.apache.org Cc: Sent: Wednesday, June 19, 2013 2:35 PM Subject: Re: INSERT non-static data to array

Re: INSERT non-static data to array?

2013-06-20 Thread Michael Malak
. From: Edward Capriolo edlinuxg...@gmail.com To: user@hive.apache.org user@hive.apache.org; Michael Malak michaelma...@yahoo.com Sent: Thursday, June 20, 2013 9:15 PM Subject: Re: INSERT non-static data to array? i think you could select into as sub query and then use

[jira] [Created] (HIVE-4771) Support subqueries in INSERT for array types

2013-06-20 Thread Michael Malak (JIRA)
Michael Malak created HIVE-4771: --- Summary: Support subqueries in INSERT for array types Key: HIVE-4771 URL: https://issues.apache.org/jira/browse/HIVE-4771 Project: Hive Issue Type

Re: INSERT non-static data to array?

2013-06-19 Thread Michael Malak
[]); INSERT INTO table_a   SELECT a, b, ARRAY(SELECT c FROM table_c WHERE table_c.parent = table_b.id)   FROM table_b From: Edward Capriolo edlinuxg...@gmail.com To: user@hive.apache.org user@hive.apache.org; Michael Malak michaelma...@yahoo.com Sent: Wednesday

Re: Hive QL - NOT IN, NOT EXIST

2013-05-05 Thread Michael Malak
--- On Sun, 5/5/13, Peter Chu pete@outlook.com wrote: I am wondering if there is any way to do this without resorting to using left outer join and finding nulls. I have found this to be an acceptable substitute. Is it not working for you?

[jira] [Commented] (HIVE-4022) Structs and struct fields cannot be NULL in INSERT statements

2013-02-20 Thread Michael Malak (JIRA)
[ https://issues.apache.org/jira/browse/HIVE-4022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13582662#comment-13582662 ] Michael Malak commented on HIVE-4022: - Note that there is a workaround for the case

[jira] [Commented] (HIVE-3528) Avro SerDe doesn't handle serializing Nullable types that require access to a Schema

2013-02-20 Thread Michael Malak (JIRA)
[ https://issues.apache.org/jira/browse/HIVE-3528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13582664#comment-13582664 ] Michael Malak commented on HIVE-3528: - As noted in the first comment from https

Re: NULLable STRUCTs

2013-02-19 Thread Michael Malak
If no one has any objection, I'm going to update HIVE-4022, which I entered a week ago when I thought the behavior was Avro-specific, to indicate it actually affects even native Hive tables. https://issues.apache.org/jira/browse/HIVE-4022 --- On Fri, 2/15/13, Michael Malak michaelma

[jira] [Updated] (HIVE-4022) Structs and struct fields cannot be NULL in INSERT statements

2013-02-19 Thread Michael Malak (JIRA)
[ https://issues.apache.org/jira/browse/HIVE-4022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Malak updated HIVE-4022: Description: Originally thought to be Avro-specific, and first noted with respect to HIVE-3528

NULLable STRUCTs

2013-02-15 Thread Michael Malak
It seems that all Hive columns (at least those of primitive types) are always NULLable? What about columns of type STRUCT? The following: echo 1,2 twovalues.csv hive CREATE TABLE tc (x INT, y INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ','; LOAD DATA LOCAL INPATH 'twovalues.csv' INTO TABLE

[jira] [Commented] (HIVE-3528) Avro SerDe doesn't handle serializing Nullable types that require access to a Schema

2013-02-14 Thread Michael Malak (JIRA)
[ https://issues.apache.org/jira/browse/HIVE-3528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13578523#comment-13578523 ] Michael Malak commented on HIVE-3528: - I've tried the latest Avro SerDe from GitHub

[jira] [Commented] (HIVE-3528) Avro SerDe doesn't handle serializing Nullable types that require access to a Schema

2013-02-14 Thread Michael Malak (JIRA)
[ https://issues.apache.org/jira/browse/HIVE-3528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13578538#comment-13578538 ] Michael Malak commented on HIVE-3528: - Sean: I mean https://github.com/apache/hive

[jira] [Commented] (HIVE-3528) Avro SerDe doesn't handle serializing Nullable types that require access to a Schema

2013-02-14 Thread Michael Malak (JIRA)
[ https://issues.apache.org/jira/browse/HIVE-3528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13578783#comment-13578783 ] Michael Malak commented on HIVE-3528: - Sean: OK, I've researched the problem further

[jira] [Created] (HIVE-4022) Avro SerDe queries don't handle hard-coded nulls for optional/nullable structs

2013-02-14 Thread Michael Malak (JIRA)
Michael Malak created HIVE-4022: --- Summary: Avro SerDe queries don't handle hard-coded nulls for optional/nullable structs Key: HIVE-4022 URL: https://issues.apache.org/jira/browse/HIVE-4022 Project

Re: INSERT INTO table with STRUCT, SELECT FROM

2013-02-13 Thread Michael Malak
any files already there. I'm paranoid, so I would write to a different directory and then move the files over... dean On Wed, Feb 13, 2013 at 1:26 PM, Michael Malak michaelma...@yahoo.com wrote: Is it possible to INSERT INTO TABLE t SELECT FROM where t has a column with a STRUCT? Based

[jira] [Commented] (AVRO-1035) Add the possibility to append to existing avro files

2013-02-07 Thread Michael Malak (JIRA)
[ https://issues.apache.org/jira/browse/AVRO-1035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13573711#comment-13573711 ] Michael Malak commented on AVRO-1035: - ha...@cloudera.com has provided example code

Re: Is it possible to append to an already existing avro file

2013-02-07 Thread Michael Malak
take it as append-able? I don't think its possible for Avro to carry it since Avro (core) does not reverse-depend on Hadoop. Should we document it somewhere though? Do you have any ideas on the best place to do that? On Thu, Feb 7, 2013 at 6:12 AM, Michael Malak michaelma...@yahoo.com

Re: Is it possible to append to an already existing avro file

2013-02-06 Thread Michael Malak
in the API: http://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/FSDataOutputStream.html. Here's a sample program that works on Hadoop 2.x in my tests: https://gist.github.com/QwertyManiac/4724582 On Wed, Feb 6, 2013 at 9:00 AM, Michael Malak michaelma...@yahoo.com wrote: I

Re: Is it possible to append to an already existing avro file

2013-02-05 Thread Michael Malak
changes in Avro. http://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/FileSystem.html#append(org.apache.hadoop.fs.Path) I have no recent personal experience with append in HDFS.  Does anyone else here? Doug On Tue, Feb 5, 2013 at 4:10 PM, Michael Malak michaelma...@yahoo.com wrote

Hard-coded inline relations

2013-01-18 Thread Michael Malak
I'm new to Pig, and it looks like there is no provision to declare relations inline in a Pig script (without LOADing from an external file)? Based on http://pig.apache.org/docs/r0.7.0/piglatin_ref2.html#Constants I would have thought the following would constitute Hello World for Pig: A =