Re: How to do map join in Spark SQL

2015-12-19 Thread Alexander Pivovarov
I collected small DF to array of tuple3 Then I registered UDF with function which is doing lookup in the array Then I just run select which uses the UDF. On Dec 18, 2015 1:06 AM, "Akhil Das" wrote: > You can broadcast your json data and then do a map side join. This

spark 1.5.2 memory leak? reading JSON

2015-12-19 Thread Eran Witkon
Hi, I tried the following code in spark-shell on spark1.5.2: *val df = sqlContext.read.json("/home/eranw/Workspace/JSON/sample/sample2.json")* *df.count()* 15/12/19 23:49:40 ERROR Executor: Managed memory leak detected; size = 67108864 bytes, TID = 3 15/12/19 23:49:40 ERROR Executor: Exception

Re: spark 1.5.2 memory leak? reading JSON

2015-12-19 Thread Ted Yu
The 'Failed to parse a value' was the cause for execution failure. Can you disclose the structure of your json file ? Maybe try latest 1.6.0 RC to see if the problem goes away. Thanks On Sat, Dec 19, 2015 at 1:55 PM, Eran Witkon wrote: > Hi, > I tried the following code

Pyspark SQL Join Failure

2015-12-19 Thread Weiwei Zhang
Hi all, I got this error when I tried to use the 'join' function to left outer join two data frames in pyspark 1.4.1. Please kindly point out the places where I made mistakes. Thank you. Traceback (most recent call last): File "/Users/wz/PycharmProjects/PysparkTraining/Airbnb/src/driver.py",

Re: Kafka - streaming from multiple topics

2015-12-19 Thread Neelesh
A related issue - When I put multiple topics in a single stream, the processing delay is as bad as the slowest task in the number of tasks created. Even though the topics are unrelated to each other, RDD at time "t1" has to wait for the RDD at "t0" is fully executed, even if most cores are

How to map a HashMap containing vertex as key and edge as values into Spark RDD

2015-12-19 Thread aparasur
After parsing the unstructured contents from a file, I now have the following HashMap of key, value pairs where the key represents vertices and the value represents the edge. Note how using the edges, the vertices can be connected by joining the values. I am now trying to build Spark vertex and

Hive error when starting up spark-shell in 1.5.2

2015-12-19 Thread Marco Mistroni
HI all posting again this as i was experiencing this error also under 1.5.1 I am running spark 1.5.2 on a Windows 10 laptop (upgraded from Windows 8) When i launch spark-shell i am getting this exception, presumably becaus ei hav eno admin right to /tmp directory on my latpop (windows 8-10 seems

Re: Is DataFrame.groupBy supposed to preserve order within groups?

2015-12-19 Thread Timothée Carayol
Thanks Michael. If I understand correctly, this is the expected behaviour then, and there is no order guarantee within grouped DataFrame. I'll comment on that blog post to report that its message is inaccurate. Your first suggestion is to use window functions. As I understand, window functions

Re: Using Spark to process JSON with gzip filed

2015-12-19 Thread Eran Witkon
Thanks, since it is just a snippt do you mean that Inflater is coming from ZLIB? Eran On Fri, Dec 18, 2015 at 11:37 AM Akhil Das wrote: > Something like this? This one uses the ZLIB compression, you can replace > the decompression logic with GZip one in your case. >

combining multiple JSON files to one DataFrame

2015-12-19 Thread Eran Witkon
Hi, Can I combine multiple JSON files to one DataFrame? I tried val df = sqlContext.read.json("/home/eranw/Workspace/JSON/sample/*") but I get an empty DF Eran

Fwd: Numpy and dynamic loading

2015-12-19 Thread Abhinav M Kulkarni
I am running Spark programs on a large cluster (for which, I do not have administrative privileges). numpy is not installed on the worker nodes. Hence, I bundled numpy with my program, but I get the following error: Traceback (most recent call last): File "/home/user/spark-script.py", line 12,

Re: 101 question on external metastore

2015-12-19 Thread Deenar Toraskar
apparently it is down to different versions of derby in the classpath, but i am unsure where the other version is coming from. The setup worked perfectly with spark 1.3.1. Deenar On 20 December 2015 at 04:41, Deenar Toraskar wrote: > Hi Yana/All > > I am getting the

Getting an error in insertion to mysql through sparkcontext in java..

2015-12-19 Thread Sree Eedupuganti
i had 9 rows in my Mysql table options.put("dbtable", "(select * from employee"); options.put("lowerBound", "1"); options.put("upperBound", "8"); options.put("numPartitions", "2"); Error : Parameter index out of range (1 > number of parameters, which is 0) -- Best Regards,

TaskCompletionListener and Exceptions

2015-12-19 Thread Neelesh
Hi, I'm trying to build automatic Kafka watermark handling in my stream apps by overriding the KafkaRDDIterator, and adding a taskcompletionlistener and updating watermarks if task was completed (the iterator has access to offsets). But I found out that there is no way to listen to a task error

Re: 101 question on external metastore

2015-12-19 Thread Deenar Toraskar
Hi Yana/All I am getting the same exception. Did you make any progress? Deenar On 5 November 2015 at 17:32, Yana Kadiyska wrote: > Hi folks, trying experiment with a minimal external metastore. > > I am following the instructions here: >

Re: ALS predictAll does not generate all the user/item ratings

2015-12-19 Thread Nick Pentreath
Hi Roberto The method predictAll in PySpark calls the underlying method predict in Scala, which takes an RDD of (userId, productId) pairs. In other words, predictAll returns predicted scores for each pair in the input. This is exactly what output you see (i.e. 3 predictions for user 1, and 1

Re: Spark batch getting hung up

2015-12-19 Thread Jeff Zhang
First you need to know where the hang happens (driver or executor), checking log would be helpful On Sat, Dec 19, 2015 at 12:25 AM, SRK wrote: > Hi, > > My Spark Batch job seems to hung up sometimes for a long time before it > starts the next stage/exits. Basically it

About Huawei-Spark/Spark-SQL-on-HBase

2015-12-19 Thread censj
I use Huawei-Spark/Spark-SQL-on-HBase,but running ./bin/hbase-sql throwing. 5/12/19 16:59:34 INFO storage.BlockManagerMaster: Registered BlockManager Exception in thread "main" java.lang.NoSuchMethodError: jline.Terminal.getTerminal()Ljline/Terminal; at

I coded an example to use Twitter stream as a data source for Spark

2015-12-19 Thread Amir Rahnama
Hi guys, Thought someone would need this: https://github.com/ambodi/realtime-spark-twitter-stream-mining you can use this approach to feed twitter stream to your spark job. So far, PySpark does not have a twitter dstream source. -- Thanks and Regards, Amir Hossein Rahnama *Tel: +46 (0)

Re: About Huawei-Spark/Spark-SQL-on-HBase

2015-12-19 Thread censj
ok! But I thank jline version error. I found pom.xml jine 0.9.94. > 在 2015年12月19日,17:29,Ravindra Pesala 写道: > > Hi censj, > > Please try the new repo at https://github.com/HuaweiBigData/astro > , not maintaining the old

Re: Dynamic jar loading

2015-12-19 Thread Jeff Zhang
Actually I would say yes and no. Yes means the jar will be fetched by executor and added to classpath, No means it would not be added to classpath of driver. That means you can not invoke the class in the jar explicitly. But you can call them indirectly. like following (or if the jar is only

Re: About Huawei-Spark/Spark-SQL-on-HBase

2015-12-19 Thread Ravindra Pesala
Hi censj, Please try the new repo at https://github.com/HuaweiBigData/astro , not maintaining the old repo. Please let me know if you still get the error. You can contact on my personal mail also at ravi.pes...@gmail.com Thanks, Ravindra. On Sat 19 Dec, 2015 2:45 pm censj

Re: Yarn application ID for Spark job on Yarn

2015-12-19 Thread Steve Loughran
On 18 Dec 2015, at 21:39, Andrew Or > wrote: Hi Roy, I believe Spark just gets its application ID from YARN, so you can just do `sc.applicationId`. If you listen for a spark start event you get the app ID, but not the real spark attempt

Re: how to fetch all of data from hbase table in spark java

2015-12-19 Thread Ted Yu
Please take a look at: examples/src/main/scala/org/apache/spark/examples/HBaseTest.scala There're various hbase connectors (search for 'apache spark hbase connector') In hbase 2.0, there would be hbase-spark module which provides hbase connector. FYI On Fri, Dec 18, 2015 at 11:56 PM, Sateesh