Re: Hive Table not from from Spark SQL

2015-03-26 Thread ๏̯͡๏
Hello Michael, Thanks for your time. 1. show tables from Spark program returns nothing. 2. What entities are you talking about ? (I am actually new to Hive as well) On Thu, Mar 26, 2015 at 8:35 PM, Michael Armbrust wrote: > What does "show tables" return? You can also run "SET " to > make sur

[Spark Streaming] Disk not being cleaned up during runtime after RDD being processed

2015-03-26 Thread Nathan Marin
Hi, I’ve been trying to use Spark Streaming for my real-time analysis application using the Kafka Stream API on a cluster (using the yarn version) of 6 executors with 4 dedicated cores and 8192mb of dedicated RAM. The thing is, my application should run 24/7 but the disk usage is leaking. This le

Re: Hive Table not from from Spark SQL

2015-03-26 Thread Michael Armbrust
What does "show tables" return? You can also run "SET " to make sure that entries from you hive site are being read correctly. On Thu, Mar 26, 2015 at 4:02 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) wrote: > I have tables dw_bid that is created in Hive and has nothing to do with > Spark. I have data in avro that i wan

Re: Missing an output location for shuffle. : (

2015-03-26 Thread Michael Armbrust
I would suggest looking for errors in the logs of your executors. On Thu, Mar 26, 2015 at 3:20 AM, 李铖 wrote: > Again,when I do larger file Spark-sql query, error occured.Anyone have got > fix it .Please help me. > Here is the track. > > org.apache.spark.shuffle.MetadataFetchFailedException: Miss

Implicit matrix factorization returning different results between spark 1.2.0 and 1.3.0

2015-03-26 Thread Ravi Mody
After upgrading to 1.3.0, ALS.trainImplicit() has been returning vastly smaller factors (and hence scores). For example, the first few product's factor values in 1.2.0 are (0.04821, -0.00674, -0.0325). In 1.3.0, the first few factor values are (2.535456E-8, 1.690301E-8, 6.99245E-8). This differenc

Re: Spark log shows only this line repeated: RecurringTimer - JobGenerator] DEBUG o.a.s.streaming.util.RecurringTimer - Callback for JobGenerator called at time X

2015-03-26 Thread Ted Yu
It is logged from RecurringTimer#loop(): private def loop() { try { while (!stopped) { clock.waitTillTime(nextTime) callback(nextTime) prevTime = nextTime nextTime += period logDebug("Callback for " + name + " called at time " + prevTime) }

Re: RDD equivalent of HBase Scan

2015-03-26 Thread Sean Owen
An RDD is a very different creature than a NoSQL store, so I would not think of them as in the same ball-park for NoSQL-like workloads. It's not built for point queries or range scans, since any request would launch a distributed job to scan all partitions. It's not something built for, say, thousa

[Spark Streaming] Disk not being cleaned up during runtime after RDD being processed

2015-03-26 Thread NathanMarin
Hi, I’ve been trying to use Spark Streaming for my real-time analysis application using the Kafka Stream API on a cluster (using the yarn version) of 6 executors with 4 dedicated cores and 8192mb of dedicated RAM. The thing is, my application should run 24/7 but the disk usage is leaking. This le

RDD Exception Handling

2015-03-26 Thread Kevin Conaway
How can we catch exceptions that are thrown from custom RDDs or custom map functions? We have a custom RDD that is throwing an exception that we would like to catch but the exception that is thrown back to the caller is a *org.apache.spark.SparkException* that does not contain any useful informat

Recreating the Mesos/Spark paper's experiments

2015-03-26 Thread Hans van den Bogert
Hi all, For my master thesis I will be characterising performance of two-level schedulers like Mesos and after reading the paper: https://www.cs.berkeley.edu/~alig/papers/mesos.pdf where Spark is also introduced I am wondering how some experiments and results came about. If this is not the pl

Re: Which RDD operations preserve ordering?

2015-03-26 Thread Ted Yu
This is related: https://issues.apache.org/jira/browse/SPARK-6340 On Thu, Mar 26, 2015 at 5:58 AM, sergunok wrote: > Hi guys, > > I don't have exact picture about preserving of ordering of elements of RDD > after executing of operations. > > Which operations preserve it? > 1) Map (Yes?) > 2) Zip

Re: RDD equivalent of HBase Scan

2015-03-26 Thread Stuart Layton
Thanks but I'm hoping to get away from hbase all together. I was wondering if there is a way to get similar scan performance directly on cached rdd's or data frames On Thu, Mar 26, 2015 at 9:54 AM, Ted Yu wrote: > In examples//src/main/scala/org/apache/spark/examples/HBaseTest.scala, > TableInpu

Spark log shows only this line repeated: RecurringTimer - JobGenerator] DEBUG o.a.s.streaming.util.RecurringTimer - Callback for JobGenerator called at time X

2015-03-26 Thread Adrian Mocanu
Here's my log output from a streaming job. What is this? 09:54:27.504 [RecurringTimer - JobGenerator] DEBUG o.a.s.streaming.util.RecurringTimer - Callback for JobGenerator called at time 1427378067504 09:54:27.505 [RecurringTimer - JobGenerator] DEBUG o.a.s.streaming.util.RecurringTimer - Call

Re: RDD equivalent of HBase Scan

2015-03-26 Thread Ted Yu
In examples//src/main/scala/org/apache/spark/examples/HBaseTest.scala, TableInputFormat is used. TableInputFormat accepts parameter public static final String SCAN = "hbase.mapreduce.scan"; where if specified, Scan object would be created from String form: if (conf.get(SCAN) != null) {

RDD equivalent of HBase Scan

2015-03-26 Thread Stuart Layton
HBase scans come with the ability to specify filters that make scans very fast and efficient (as they let you seek for the keys that pass the filter). Do RDD's or Spark DataFrames offer anything similar or would I be required to use a NoSQL db like HBase to do something like this? -- Stuart Layt

Re: python : Out of memory: Kill process

2015-03-26 Thread Eduardo Cusa
Hi Davies, I upgrade to 1.3.0 and still getting Out of Memory. I ran the same code as before, I need to make any changes? On Wed, Mar 25, 2015 at 4:00 PM, Davies Liu wrote: > With batchSize = 1, I think it will become even worse. > > I'd suggest to go with 1.3, have a taste for the new Dat

Populating a HashMap from a GraphX connectedComponents graph

2015-03-26 Thread Bob DuCharme
The Scala code below was based on https://www.sics.se/~amir/files/download/dic/answers6.pdf. I extended it by adding a HashMap called componentLists that I populated with each component's starting node as the key and then a ListBuffer of the component's members. As the output below the code sho

Re: Spark-1.3.0 UI shows 0 cores in completed applications tab

2015-03-26 Thread Sean Owen
https://issues.apache.org/jira/browse/SPARK-5771 ? On Thu, Mar 26, 2015 at 12:58 PM, MEETHU MATHEW wrote: > Hi all, > > I started spark-shell in spark-1.3.0 and did some actions. The UI was > showing 8 cores under the running applications tab. But when I exited the > spark-shell using exit, the a

Spark-1.3.0 UI shows 0 cores in completed applications tab

2015-03-26 Thread MEETHU MATHEW
Hi all, I started spark-shell in spark-1.3.0 and did some actions. The UI was showing 8 cores under the running applications tab. But when I exited the spark-shell using exit, the application is moved to completed applications tab and the number of cores is 0. Again when I exited the spark-shell

Which RDD operations preserve ordering?

2015-03-26 Thread sergunok
Hi guys, I don't have exact picture about preserving of ordering of elements of RDD after executing of operations. Which operations preserve it? 1) Map (Yes?) 2) ZipWithIndex (Yes or sometimes yes?) Serg. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Wh

Re: Handling Big data for interactive BI tools

2015-03-26 Thread kundan kumar
I was looking for some options and came across JethroData. http://www.jethrodata.com/ This stores the data maintaining indexes over all the columns seems good and claims to have better performance than Impala. Earlier I had tried Apache Phoenix because of its secondary indexing feature. But the

Re: Spark-core and guava

2015-03-26 Thread Sean Owen
This is a long and complicated story. In short, Spark shades Guava 14 except for a few classes that were accidentally used in a public API (Optional and a few more it depends on). So "provided" is more of a Maven workaround to achieve a desired effect. It's not "provided" in the usual sense. On Th

Re: Handling Big data for interactive BI tools

2015-03-26 Thread kundan kumar
I looking for some options and came across http://www.jethrodata.com/ On Thu, Mar 26, 2015 at 5:47 PM, Jörn Franke wrote: > You can also preaggregate results for the queries by the user - depending > on what queries they use this might be necessary for any underlying > technology > Le 26 mars 2

Spark-core and guava

2015-03-26 Thread Stevo Slavić
Hello Apache Spark community, spark-core 1.3.0 has guava 14.0.1 as provided dependency (see http://repo1.maven.org/maven2/org/apache/spark/spark-core_2.10/1.3.0/spark-core_2.10-1.3.0.pom ) What is supposed to provide guava, and that specific version? Kind regards, Stevo Slavic.

Re: Handling Big data for interactive BI tools

2015-03-26 Thread Jörn Franke
You can also preaggregate results for the queries by the user - depending on what queries they use this might be necessary for any underlying technology Le 26 mars 2015 11:27, "kundan kumar" a écrit : > Hi, > > I need to store terabytes of data which will be used for BI tools like > qlikview. > >

Why executor encourage OutOfMemoryException: Java heap space

2015-03-26 Thread sergunok
Hi all, sometimes you can see "OutOfMemoryException: Java heap space" of executor in Spark. There many ideas about how to work arounds. My question is: how does executor execute tasks from the point of view of memory usage and parallelism? Picture in my mind is: Executor is JVM instance. Number

Why k-means cluster hang for a long time?

2015-03-26 Thread Xi Shen
Hi, When I run k-means cluster with Spark, I got this in the last two lines in the log: 15/03/26 11:42:42 INFO spark.ContextCleaner: Cleaned broadcast 26 15/03/26 11:42:42 INFO spark.ContextCleaner: Cleaned shuffle 5 Then it hangs for a long time. There's no active job. The driver machine is i

Re: Windowing and Analytics Functions in Spark SQL

2015-03-26 Thread Arush Kharbanda
You can look at the Spark SQL programming guide. http://spark.apache.org/docs/1.3.0/sql-programming-guide.html and the Spark API. http://spark.apache.org/docs/1.3.0/api/scala/index.html#org.apache.spark.package On Thu, Mar 26, 2015 at 5:21 PM, Masf wrote: > Ok, > > Thanks. Some web resource whe

Re: Windowing and Analytics Functions in Spark SQL

2015-03-26 Thread Masf
Ok, Thanks. Some web resource where I could check the functionality supported by Spark SQL? Thanks!!! Regards. Miguel Ángel. On Thu, Mar 26, 2015 at 12:31 PM, Cheng Lian wrote: > We're working together with AsiaInfo on this. Possibly will deliver an > initial version of window function suppo

Re: Windowing and Analytics Functions in Spark SQL

2015-03-26 Thread Cheng Lian
We're working together with AsiaInfo on this. Possibly will deliver an initial version of window function support in 1.4.0. But it's not a promise yet. Cheng On 3/26/15 7:27 PM, Arush Kharbanda wrote: Its not yet implemented. https://issues.apache.org/jira/browse/SPARK-1442 On Thu, Mar 26,

Re: Windowing and Analytics Functions in Spark SQL

2015-03-26 Thread Arush Kharbanda
Its not yet implemented. https://issues.apache.org/jira/browse/SPARK-1442 On Thu, Mar 26, 2015 at 4:39 PM, Masf wrote: > Hi. > > Are the Windowing and Analytics functions supported in Spark SQL (with > HiveContext or not)? For example in Hive is supported > https://cwiki.apache.org/confluence/d

Re: SparkSQL overwrite parquet file does not generate _common_metadata

2015-03-26 Thread Cheng Lian
I couldn’t reproduce this with the following spark-shell snippet: |scala> import sqlContext.implicits._ scala> Seq((1, 2)).toDF("a", "b") scala> res0.save("xxx", org.apache.spark.sql.SaveMode.Overwrite) scala> res0.save("xxx", org.apache.spark.sql.SaveMode.Overwrite) | The _common_metadata file

Re: Write Parquet File with spark-streaming with Spark 1.3

2015-03-26 Thread Cheng Lian
You may resort to the generic save API introduced in 1.3, which supports appending as long as the target data source supports it. And in 1.3, Parquet does support appending. Cheng On 3/26/15 4:13 PM, Richard Grossman wrote: Hi I've succeed to write kafka stream to parquet file in Spark 1.2 b

Re: Handling Big data for interactive BI tools

2015-03-26 Thread Akhil Das
Yes, you can easily configure Spark Thrift server and connect BI Tools. Here's an example showing how to integrate SparkSQL with Tableau dashboards. Thanks Best Regards On Thu, Mar 26, 2015 at 3:56 PM, kundan ku

Re: How to deploy binary dependencies to workers?

2015-03-26 Thread Xi Shen
OK, after various testing, I found the native library can be loaded if running in yarn-cluster mode. But I still cannot find out why it won't load when running in yarn-client mode... Thanks, David On Thu, Mar 26, 2015 at 4:21 PM Xi Shen wrote: > Not of course...all machines in HDInsight are W

Windowing and Analytics Functions in Spark SQL

2015-03-26 Thread Masf
Hi. Are the Windowing and Analytics functions supported in Spark SQL (with HiveContext or not)? For example in Hive is supported https://cwiki.apache.org/confluence/display/Hive/LanguageManual+WindowingAndAnalytics Some tutorial or documentation where I can see all features supported by Spark SQ

Re: Hive Table not from from Spark SQL

2015-03-26 Thread ๏̯͡๏
I have tables dw_bid that is created in Hive and has nothing to do with Spark. I have data in avro that i want to join with dw_bid table, this join needs to be done using Spark SQL. However for some reason Spark says dw_bid table does not exist. How do i say spark that dw_bid is a table created i

Re: Hive Table not from from Spark SQL

2015-03-26 Thread ๏̯͡๏
I have this query insert overwrite table sojsuccessevents2_spark select guid,sessionKey,sessionStartDate,sojDataDate,seqNum,eventTimestamp,siteId,successEventType,sourceType,itemId, shopCartId,b.transaction_Id as transactionId,offerId,b.bdr_id as userId,priorPage1SeqNum,priorPage1PageId,exclWMSea

Re: Column not found in schema when querying partitioned table

2015-03-26 Thread Jon Chase
I've filed this as https://issues.apache.org/jira/browse/SPARK-6554 On Thu, Mar 26, 2015 at 6:29 AM, Jon Chase wrote: > Spark 1.3.0, Parquet > > I'm having trouble referencing partition columns in my queries. > > In the following example, 'probeTypeId' is a partition column. For > example, the

Port configuration for BlockManagerId

2015-03-26 Thread Manish Gupta 8
Hi, I am running spark-shell and connecting with a yarn cluster with deploy mode as "client". In our environment, there are some security policies that doesn't allow us to open all TCP port. Issue I am facing is: Spark Shell driver is using a random port for BlockManagerID - BlockManagerId(, ho

Column not found in schema when querying partitioned table

2015-03-26 Thread Jon Chase
Spark 1.3.0, Parquet I'm having trouble referencing partition columns in my queries. In the following example, 'probeTypeId' is a partition column. For example, the directory structure looks like this: /mydata /probeTypeId=1 ...files... /probeTypeId=2 ...files... I see

Handling Big data for interactive BI tools

2015-03-26 Thread kundan kumar
Hi, I need to store terabytes of data which will be used for BI tools like qlikview. The queries can be on the basis of filter on any column. Currently, we are using redshift for this purpose. I am trying to explore things other than the redshift . Is it possible to gain better performance in

Missing an output location for shuffle. : (

2015-03-26 Thread 李铖
Again,when I do larger file Spark-sql query, error occured.Anyone have got fix it .Please help me. Here is the track. org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 0 at org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$con

Re: upgrade from spark 1.2.1 to 1.3 on EC2 cluster and problems

2015-03-26 Thread Steve Loughran
On 25 Mar 2015, at 21:54, roni mailto:roni.epi...@gmail.com>> wrote: Is there any way that I can install the new one and remove previous version. I installed spark 1.3 on my EC2 master and set teh spark home to the new one. But when I start teh spark-shell I get - java.lang.UnsatisfiedLinkError

Re: OOM for HiveFromSpark example

2015-03-26 Thread Akhil Das
Could you try putting that file in hdfs and try like: LOAD DATA INPATH 'hdfs://sigmoid/test/kv1.txt' INTO TABLE src_spark Thanks Best Regards On Thu, Mar 26, 2015 at 2:07 PM, Akhil Das wrote: > When you run it in local mode ^^ > > Thanks > Best Regards > > On Thu, Mar 26, 2015 at 2:06 PM, ÐΞ€ρ

Re: OOM for HiveFromSpark example

2015-03-26 Thread Akhil Das
When you run it in local mode ^^ Thanks Best Regards On Thu, Mar 26, 2015 at 2:06 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) wrote: > I don;t think thats correct. load data local should pick input from local > directory. > > On Thu, Mar 26, 2015 at 1:59 PM, Akhil Das > wrote: > >> Not sure, but you can create that pat

Re: Registering custom UDAFs with HiveConetxt in SparkSQL, how?

2015-03-26 Thread ๏̯͡๏
Hello Shahab, Are you able to read tables created in Hive from Spark SQL ? If yes, how are you referring them ? On Thu, Mar 26, 2015 at 1:11 PM, Takeshi Yamamuro wrote: > I think it is not `sqlContext` but hiveContext because `create temporary > function` is not supported in SQLContext. > > On

Re: OOM for HiveFromSpark example

2015-03-26 Thread ๏̯͡๏
I don;t think thats correct. load data local should pick input from local directory. On Thu, Mar 26, 2015 at 1:59 PM, Akhil Das wrote: > Not sure, but you can create that path in all workers and put that file in > it. > > Thanks > Best Regards > > On Thu, Mar 26, 2015 at 1:56 PM, ÐΞ€ρ@Ҝ (๏̯͡๏)

Re: OOM for HiveFromSpark example

2015-03-26 Thread Akhil Das
Not sure, but you can create that path in all workers and put that file in it. Thanks Best Regards On Thu, Mar 26, 2015 at 1:56 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) wrote: > The Hive command > > LOAD DATA LOCAL INPATH > '/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/examples/src/main/resources/kv1.txt' > IN

Re: OOM for HiveFromSpark example

2015-03-26 Thread ๏̯͡๏
The Hive command LOAD DATA LOCAL INPATH '/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/examples/src/main/resources/kv1.txt' INTO TABLE src_spark 1. LOCAL INPATH. if i push to HDFS then how will it work ? 2. I cant use sc.addFile, cause i want to run Hive (Spark SQL) queries. On Thu, Mar 2

Write Parquet File with spark-streaming with Spark 1.3

2015-03-26 Thread Richard Grossman
Hi I've succeed to write kafka stream to parquet file in Spark 1.2 but I can't make it with spark 1.3 As in streaming I can't use saveAsParquetFile() because I can't add data to an existing parquet File I know that it's possible to stream data directly into parquet could you help me by providing

Re: OOM for HiveFromSpark example

2015-03-26 Thread Akhil Das
Now its clear that the workers are not having the file kv1.txt in their local filesystem. You can try putting that in hdfs and use the URI to that file or try adding the file with sc.addFile Thanks Best Regards On Thu, Mar 26, 2015 at 1:38 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) wrote: > Does not work > > 15/03/26 0

Re: OOM for HiveFromSpark example

2015-03-26 Thread ๏̯͡๏
Does not work 15/03/26 01:07:05 INFO HiveMetaStore.audit: ugi=dvasthimal ip=unknown-ip-addr cmd=get_table : db=default tbl=src_spark 15/03/26 01:07:06 ERROR ql.Driver: FAILED: SemanticException Line 1:23 Invalid path ''/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/examples/src/main/resources

Re: Unable to Hive program from Spark Programming Guide (OutOfMemoryError)

2015-03-26 Thread ๏̯͡๏
Resolved. Bold text is FIX. ./bin/spark-submit -v --master yarn-cluster --jars /home/dvasthimal/spark1.3/spark-avro_2.10-1.0.0.jar,/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-api-jdo-3.2.6.jar,/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-core-3.2.10.j

Hive Table not from from Spark SQL

2015-03-26 Thread ๏̯͡๏
I have a hive table named dw_bid, when i run hive from command prompt and run describe dw_bid, it works. I want to join a avro file (table) in HDFS with this hive dw_bid table and i refer it as dw_bid from Spark SQL program, however i see 15/03/26 00:31:01 INFO HiveMetaStore.audit: ugi=dvasthimal

Re: Registering custom UDAFs with HiveConetxt in SparkSQL, how?

2015-03-26 Thread Takeshi Yamamuro
I think it is not `sqlContext` but hiveContext because `create temporary function` is not supported in SQLContext. On Wed, Mar 25, 2015 at 5:58 AM, Jon Chase wrote: > Shahab - > > This should do the trick until Hao's changes are out: > > > sqlContext.sql("create temporary function foobar as > 'c

Can I call aggregate UDF in DataFrame?

2015-03-26 Thread Haopu Wang
Specifically there are only 5 aggregate functions in class org.apache.spark.sql.GroupedData: sum/max/min/mean/count. Can I plugin a function to calculate stddev? Thank you! - To unsubscribe, e-mail: user-unsubscr...@spark.apach

Re: How to troubleshoot server.TransportChannelHandler Exception

2015-03-26 Thread Xi Shen
ah~hell, I am using Spark 1.2.0, and my job was submitted to use 8 cores...the magic number in the bug. [image: --] Xi Shen [image: http://]about.me/davidshen On Thu, Mar 26, 2015 at 5:48 PM, Akhil Das wrote: > Whats y

Re: Spark-sql query got exception.Help

2015-03-26 Thread 李铖
Yes,the exception occured sometimes,but at the end the final result rised. 2015-03-26 11:08 GMT+08:00 Saisai Shao : > Would you mind running again to see if this exception can be reproduced > again, since exception in MapOutputTracker seldom occurs, maybe some other > exceptions which lead to t

Re: writing DStream RDDs to the same file

2015-03-26 Thread Akhil Das
Heres something similar which i used to do: unionDStream.foreachRDD(rdd => { val events = rdd.count() println("Received Events : " + rdd.count()) if(events > 0 ){ val fw = new FileWriter("events", true) fw.write(Calendar.getInstance().getTime + "," + events + "\n") fw.close() } }) Sending from ce

Re: Can LBFGS be used on streaming data?

2015-03-26 Thread EcoMotto Inc.
Hello DB, Thank you! Do you know how to run Linear Regression without SGD on streaming data in spark? I tried SGD but due to step size I was not getting the expected weights. Best Regards, Arunkumar On Wed, Mar 25, 2015 at 4:33 PM, DB Tsai wrote: > Hi Arunkumar, > > I think L-BFGS will not wor

<    1   2