Re: Reading from HBase is too slow

2014-09-30 Thread Tao Xiao
I checked HBase UI. Well, this table is not completely evenly spread across the nodes, but I think to some extent it can be seen as nearly evenly spread - at least there is not a single node which has too many regions. Here is a screenshot of HBase UI

Re: ExecutorLostFailure kills sparkcontext

2014-09-30 Thread Akhil Das
I also had similar problem while joining a dataset. After digging into the worker logs i figured out it was throwing CancelledKeyException, Not sure the cause. Thanks Best Regards On Tue, Sep 30, 2014 at 5:15 AM, jamborta jambo...@gmail.com wrote: hi all, I have a problem with my application

RE: Spark SQL question: why build hashtable for both sides in HashOuterJoin?

2014-09-30 Thread Haopu Wang
Hi, Liquan, thanks for the response. In your example, I think the hash table should be built on the right side, so Spark can iterate through the left side and find matches in the right side from the hash table efficiently. Please comment and suggest, thanks again!

SparkSQL DataType mappings

2014-09-30 Thread Costin Leau
Hi, I'm working on supporting SchemaRDD in Elasticsearch Hadoop [1] but I'm having some issues with the SQL API, in particular in what the DataTypes translate to. 1. A SchemaRDD is composed of a Row and StructType - I'm using the latter to decompose a Row into primitives. I'm not clear

Spark Streaming for time consuming job

2014-09-30 Thread Eko Susilo
Hi All, I have a problem that i would like to consult about spark streaming. I have a spark streaming application that parse a file (which will be growing as time passed by)This file contains several columns containing lines of numbers, these parsing is divided into windows (each 1 minute). Each

Re: how to run spark job on yarn with jni lib?

2014-09-30 Thread taqilabon
We currently choose not to run jobs on yarn, so I stop trying this. Anyway thanks for you guys' suggestions. At least, your solutions may help people who must run their jobs on yarn : ) -- View this message in context:

Re: Spark SQL question: why build hashtable for both sides in HashOuterJoin?

2014-09-30 Thread Liquan Pei
Hi Haopu, How about full outer join? One hash table may not be efficient for this case. Liquan On Mon, Sep 29, 2014 at 11:47 PM, Haopu Wang hw...@qilinsoft.com wrote: Hi, Liquan, thanks for the response. In your example, I think the hash table should be built on the right side, so

Getting erorrs in spark worker nodes

2014-09-30 Thread Murthy Chelankuri
I am new to the spark. I am trying to implement the spark streaming from the kafka topic. It worked fine for some time. but some time later it started throwing the below error. I am not getting any clue what causing the issues. java.lang.Exception: Could not compute split, block

Re: Systematic error when re-starting Spark stream unless I delete all checkpoints

2014-09-30 Thread Svend Vanderveken
Hi again, Just FYI, I found the mistake in my code regarding restartability of spark streaming: I had a method providing a context (either retrieved from checkpoint or, if no checkpoint available, built anew) and was building then starting a stream on it. The mistake is that we should not build

Parallel spark jobs on mesos cluster

2014-09-30 Thread Sarath Chandra
Hi All, I have a requirement to process a set of files in parallel. So I'm submitting spark jobs using java's ExecutorService. But when I do this way, 1 or more jobs are failing with status as EXITED. Earlier I tried with a standalone spark cluster setting the job scheduling to Fair Scheduling. I

Re: Ack RabbitMQ messages after processing through Spark Streaming

2014-09-30 Thread khaledh
As a follow up to my own question, I see that the FlumeBatchFetcher https://github.com/apache/spark/blob/master/external/flume/src/main/scala/org/apache/spark/streaming/flume/FlumeBatchFetcher.scala ack's the batch only after it calls Receiver.store(...). So my question is: does store()

Re: S3 - Extra $_folder$ files for every directory node

2014-09-30 Thread pouryas
I would like to know a way for not adding those $_folder$ files to S3 as well. I can go ahead and delete them but it would be nice if Spark handles this for you. -- View this message in context:

Re: S3 - Extra $_folder$ files for every directory node

2014-09-30 Thread Nicholas Chammas
Those files are created by the Hadoop API that Spark leverages. Spark does not directly control that. You may be able to check with the Hadoop project on whether they are looking at changing this behavior. I believe it was introduced because S3 at one point required it, though it doesn't anymore.

Re: Window comparison matching using the sliding window functionality: feasibility

2014-09-30 Thread nitinkak001
Any ideas guys? Trying to find some information online. Not much luck so far. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Window-comparison-matching-using-the-sliding-window-functionality-feasibility-tp15352p15404.html Sent from the Apache Spark User

registering Array of CompactBuffer to Kryo

2014-09-30 Thread Andras Barjak
Hi, what is the correct scala code to register an Array of this private spark class to Kryo? java.lang.IllegalArgumentException: Class is not registered: org.apache.spark.util.collection.CompactBuffer[] Note: To register this class use:

Re: Window comparison matching using the sliding window functionality: feasibility

2014-09-30 Thread Jimmy McErlain
Not sure if this is what you are after but its based on a moving average within spark... I was building an ARIMA model on top of spark and this helped me out a lot: http://stackoverflow.com/questions/23402303/apache-spark-moving-average ᐧ *JIMMY MCERLAIN* DATA SCIENTIST (NERD) *. . . . . .

Re: Spark SQL and Hive tables

2014-09-30 Thread Chen Song
I have ran into the same issue. I understand with the new assembly built with -Phive, I can run a spark job in yarn-cluster mode. But is there a way for me to run spark-shell with support of hive? I tried to add the new assembly jar with --driver-library-path and --driver-class-path but neither

Re: How to run kmeans after pca?

2014-09-30 Thread st553
Thanks for your response Burak it was very helpful. I am noticing that if I run PCA before KMeans that the KMeans algorithm will actually take longer to run than if I had just run KMeans without PCA. I was hoping that by using PCA first it would actually speed up the KMeans algorithm. I have

Installation question

2014-09-30 Thread mohan
Sorry to ask another basic question. Could you point out what I should read to setup a pseudo-distributed Hadoop,Mahout and Spark cluster ? Does it really need something like CDH ? I want to access Mahout and Spark output and display in Play(outside CDH). I also want to access Spark output from

Re: How to run kmeans after pca?

2014-09-30 Thread Evan R. Sparks
Caching after doing the multiply is a good idea. Keep in mind that during the first iteration of KMeans, the cached rows haven't yet been materialized - so it is both doing the multiply and the first pass of KMeans all at once. To isolate which part is slow you can run cachedRows.numRows() to

timestamp not implemented yet

2014-09-30 Thread tonsat
We have installed spark 1.1 stand alone master mode. when we are trying to access parquet format table. we are getting below error and one of the field is defined as timestamp. Based on information provided in apache.spark.org spark supports parquet and pretty much all hive datatypes including

Re: Schema change on Spark Hive (Parquet file format) table not working

2014-09-30 Thread barge.nilesh
code snippet in short: hiveContext.sql(*CREATE EXTERNAL TABLE IF NOT EXISTS people_table (name String, age INT) ROW FORMAT SERDE 'parquet.hive.serde.ParquetHiveSerDe' STORED AS INPUTFORMAT 'parquet.hive.DeprecatedParquetInputFormat' OUTPUTFORMAT 'parquet.hive.DeprecatedParquetOutputFormat'*);

Re: Spark run slow after unexpected repartition

2014-09-30 Thread matthes
I have the same problem! I start the same job 3 or 4 times again, it depends how big the data and the cluster are. The runtime goes down in the following jobs. And at the end I get the Fetch failure error and at this point I must restart the spark shell and everything works well again. And I don't

Re: timestamp not implemented yet

2014-09-30 Thread barge.nilesh
Spark 1.1 comes with Hive 0.12 and Hive 0.12, for parquet format, doesn't support timestamp datatype. https://cwiki.apache.org/confluence/display/Hive/Parquet#Parquet-Limitations https://cwiki.apache.org/confluence/display/Hive/Parquet#Parquet-Limitations -- View this message in context:

[no subject]

2014-09-30 Thread PENG ZANG
Hi, We have a cluster setup with spark 1.0.2 running 4 workers and 1 master with 64G RAM for each. In the sparkContext we specify 32G executor memory. However, as long as the task running longer than approximate 15 mins, all the executors are lost just like some sort of timeout no matter if the

Re: timestamp not implemented yet

2014-09-30 Thread tonsat
Thank you Nilesh originally tables was created in impala same table trying to access through spark-sql. Any idea what is the best file format we should be using running spark jobs if we have date field? -- View this message in context:

MLLib ALS question

2014-09-30 Thread Alex T
Hi, I'm trying to use Matrix Factorization over a dataset with like 6.5M users, 2.5M products and 120M ratings over products. The test is done in standalone mode, with unique worker (Quad-core and 16 Gb RAM). The program runs out of memory, and I think that this happens because flatMap holds

MLLib: Missing value imputation

2014-09-30 Thread Sameer Tilak
Hi All,Can someone please me to the documentation that describes how missing value imputation is done in MLLib. Also, any information of how this fits in the overall roadmap will be great.

pyspark cassandra examples

2014-09-30 Thread David Vincelli
I've been trying to get the cassandra_inputformat.py and cassandra_outputformat.py examples running for the past half day. I am running cassandra21 community from datastax on a single node (in my dev environment) with spark-1.1.0-bin-hadoop2.4. I can connect and use cassandra via cqlsh and I can

Re: Unresolved attributes: SparkSQL on the schemaRDD

2014-09-30 Thread Yin Huai
I think this problem has been fixed after the 1.1 release. Can you try the master branch? On Mon, Sep 29, 2014 at 10:06 PM, vdiwakar.malladi vdiwakar.mall...@gmail.com wrote: I'm using the latest version i.e. Spark 1.1.0 Thanks. -- View this message in context:

Re: MLLib ALS question

2014-09-30 Thread Xiangrui Meng
You may need a cluster with more memory. The current ALS implementation constructs all subproblems in memory. With rank=10, that means (6.5M + 2.5M) * 10^2 / 2 * 8 bytes = 3.5GB. The ratings need 2GB, not counting the overhead. ALS creates in/out blocks to optimize the computation, which takes

Re: MLLib: Missing value imputation

2014-09-30 Thread Xiangrui Meng
We don't handle missing value imputation in the current version of MLlib. In future releases, we can store feature information in the dataset metadata, which may store the default value to replace missing values. But no one is committed to work on this feature. For now, you can filter out examples

how to get actual count from as long from JavaDStream ?

2014-09-30 Thread Andy Davidson
Hi I have a simple streaming app. All I want to do is figure out how many lines I have received in the current mini batch. If numLines was a JavaRDD I could simply call count(). How do you do something similar in Streaming? Here is my psudo code JavaDStreamString msg =

Re: how to get actual count from as long from JavaDStream ?

2014-09-30 Thread Jon Gregg
Hi Andy I'm new to Spark and have been working with Scala not Java but I see there's a dstream() method to convert from JavaDStream to DStream. Then within DStream http://people.apache.org/~pwendell/spark-1.1.0-rc4-docs/api/java/org/apache/spark/streaming/dstream/DStream.html there is a

Re: Short Circuit Local Reads

2014-09-30 Thread Andrew Ash
Hi Gary, I gave this a shot on a test cluster of CDH4.7 and actually saw a regression in performance when running the numbers. Have you done any benchmarking? Below are my numbers: Experimental method: 1. Write 14GB of data to HDFS via [1] 2. Read data multiple times via [2] *Experiment 1:

processing large number of files

2014-09-30 Thread SK
Hi, I am trying to compute the number of unique users from a year's worth of data. So there are about 300 files and each file is quite large (~GB). I first tried this without a loop by reading all the files in the directory using the glob pattern: sc.textFile(dir/*). But the tasks were

Re: shuffle memory requirements

2014-09-30 Thread Andrew Ash
Hi Maddenpj, Right now the best estimate I've heard for the open file limit is that you'll need the square of the largest partition count in your dataset. I filed a ticket to log the ulimit value when it's too low at https://issues.apache.org/jira/browse/SPARK-3750 On Mon, Sep 29, 2014 at 6:20

Re: pyspark cassandra examples

2014-09-30 Thread Kan Zhang
java.lang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.JobContext, but class was expected Most likely it is the Hadoop 1 vs Hadoop 2 issue. The example was given for Hadoop 1 (default Hadoop version for Spark). You may try to set the output format class in conf for

Handling tree reduction algorithm with Spark in parallel

2014-09-30 Thread Boromir Widas
Hello Folks, I have been trying to implement a tree reduction algorithm recently in spark but could not find suitable parallel operations. Assuming I have a general tree like the following - I have to do the following - 1) Do some computation at each leaf node to get an array of doubles.(This

Re: Short Circuit Local Reads

2014-09-30 Thread Kay Ousterhout
Hi Andrew and Gary, I've done some experimentation with this and had similar results. I can't explain the speedup in write performance, but I dug into the read slowdown and found that enabling short-circuit reads results in Hadoop not doing read-ahead in the same way. At a high level, when SCR

Re: pyspark cassandra examples

2014-09-30 Thread David Vincelli
Thanks, that worked! I downloaded the version pre-built against hadoop1 and the examples worked. - David On Tue, Sep 30, 2014 at 5:08 PM, Kan Zhang kzh...@apache.org wrote: java.lang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.JobContext, but class was expected

Re: Handling tree reduction algorithm with Spark in parallel

2014-09-30 Thread Andy Twigg
Hi Boromir, Assuming the tree fits in memory, and what you want to do is parallelize the computation, the 'obvious' way is the following: * broadcast the tree T to each worker (ok since it fits in memory) * construct an RDD for the deepest level - each element in the RDD is (parent,data_at_node)

Re: processing large number of files

2014-09-30 Thread Liquan Pei
You can use sc.wholeTextFiles to read a directory of text file. Also, it seems from your code that you are only interested in the current year's count, you can perform a filter before distinct() and perform a reduce to sum up counts. Hope this helps! Liquan On Tue, Sep 30, 2014 at 1:59 PM, SK

Re: Multiple exceptions in Spark Streaming

2014-09-30 Thread Tathagata Das
Is this the logs of the worker where the failure occurs? I think issues similar to these have since been solved in later versions of Spark. TD On Tue, Sep 30, 2014 at 11:33 AM, Shaikh Riyaz shaikh@gmail.com wrote: Dear All, We are using Spark streaming version 1.0.0 in our Cloudea Hadoop

RDD not getting generated

2014-09-30 Thread vvarma
Hi, I am new to Spark. I have written custom rabbit mq receiver which calls the store method of Receiver interface. I can see that the block is being stored. I am trying to process each rdd in the dstream using the foreach function, but am unable to figure out why this block is not getting invoked

Re: how to get actual count from as long from JavaDStream ?

2014-09-30 Thread Andy Davidson
Hi Jon Thanks, foreachRDD seems to work. I am running on a 4 machine cluster. Its seems like Function executed by foreachRDD is running on my driver. I used logging to check. This is exactly what I want. I need to write my final results back to stdout so RDD.pipe() will work. I do not have any

Re: Handling tree reduction algorithm with Spark in parallel

2014-09-30 Thread Debasish Das
If the tree is too big build it on graphxbut it will need thorough analysis so that the partitions are well balanced... On Tue, Sep 30, 2014 at 2:45 PM, Andy Twigg andy.tw...@gmail.com wrote: Hi Boromir, Assuming the tree fits in memory, and what you want to do is parallelize the

apache spark union function cause executors disassociate (Lost executor 1 on 172.32.1.12: remote Akka client disassociated)

2014-09-30 Thread Edwin
I have a 3 nodes ec2, each assigned 18G for the spark-executor-mem, So after I run my spark batch job, I got two rdd from different forks, but with the exact same format. And when i perform union operations, I got executors disassociate error and the whole spark job fail and quit. Memory shouldn't

memory vs data_size

2014-09-30 Thread anny9699
Hi, Is there a guidance about for a data of certain data size, how much total memory should be needed to achieve a relatively good speed? I have a data of around 200 GB and the current total memory for my 8 machines are around 120 GB. Is that too small to run the data of this big? Even the read

Re: apache spark union function cause executors disassociate (Lost executor 1 on 172.32.1.12: remote Akka client disassociated)

2014-09-30 Thread Edwin
does union function cause any data shuffling? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/apache-spark-union-function-cause-executors-disassociate-Lost-executor-1-on-172-32-1-12-remote-Akka--tp15442p15444.html Sent from the Apache Spark User List

Re: apache spark union function cause executors disassociate (Lost executor 1 on 172.32.1.12: remote Akka client disassociated)

2014-09-30 Thread Edwin
19:02:45,963 INFO [org.apache.spark.MapOutputTrackerMaster] (spark-akka.actor.default-dispatcher-14) Size of output statuses for shuffle 1 is 216 bytes 19:02:45,964 INFO [org.apache.spark.scheduler.DAGScheduler] (spark-akka.actor.default-dispatcher-14) Got job 5 (getCallSite at null:-1) with

Re: Multiple exceptions in Spark Streaming

2014-09-30 Thread Shaikh Riyaz
Hi TD, Thanks for your reply. Attachment in previous email was from Master. Below is the log message from one of the worker. --- 2014-10-01 01:49:22,348 ERROR akka.remote.EndpointWriter: AssociationError

Re: memory vs data_size

2014-09-30 Thread Liquan Pei
Hi, By default, 60% of JVM memory is reserved for RDD caching, so in your case, 72GB memory is available for RDDs which means that your total data may fit in memory. You can check the RDD memory statistics via the storage tab in web ui. Hope this helps! Liquan On Tue, Sep 30, 2014 at 4:11 PM,

Re: memory vs data_size

2014-09-30 Thread Debasish Das
Only fit the data in memory where you want to run the iterative algorithm For map-reduce operations, it's better not to cache if you have a memory crunch... Also schedule the persist and unpersist such that you utilize the RAM well... On Tue, Sep 30, 2014 at 4:34 PM, Liquan Pei

A sample for generating big data - and some design questions

2014-09-30 Thread Steve Lewis
This sample below is essentially word count modified to be big data by turning lines into groups of upper case letters and then generating all case variants - it is modeled after some real problems in biology The issue is I know how to do this in Hadoop but in Spark the use of a List in my flatmap

Poor performance writing to S3

2014-09-30 Thread Gustavo Arjones
Hi, I’m trying to save about a million of lines containing statistics data, something like: 233815212529_10152316612422530 233815212529_10152316612422530 1328569332 1404691200 1404691200 1402316275 46 0 0 7 0 0 0

Re: Short Circuit Local Reads

2014-09-30 Thread Andrew Ash
Thanks for the research Kay! It does seem addressed, and hopefully fixed in that ticket conversation also in https://issues.apache.org/jira/browse/HDFS-4697 So the best thing here is to wait to upgrade to a version of Hadoop that has that fix and then repeating the test right now. That will be

Re: Multiple exceptions in Spark Streaming

2014-09-30 Thread Tathagata Das
It would help to turn on debug level logging in log4j and see the logs. Just looking at the error logs is not giving me any sense. :( TD On Tue, Sep 30, 2014 at 4:30 PM, Shaikh Riyaz shaikh@gmail.com wrote: Hi TD, Thanks for your reply. Attachment in previous email was from Master.

Re: Reading from HBase is too slow

2014-09-30 Thread Ted Yu
Can you launch a job which exercises TableInputFormat on the same table without using Spark ? This would show whether the slowdown is in HBase code or somewhere else. Cheers On Mon, Sep 29, 2014 at 11:40 PM, Tao Xiao xiaotao.cs@gmail.com wrote: I checked HBase UI. Well, this table is not

Re: Spark 1.1.0 hbase_inputformat.py not work

2014-09-30 Thread Kan Zhang
I somehow missed this. Do you still have problem? You probably didn't specify the correct spark-examples jar using --driver-class-path. See the following for an example. MASTER=local ./bin/spark-submit --driver-class-path ./examples/target/scala-2.10/spark-examples-1.1.0-SNAPSHOT-hadoop1.0.4.jar

How to get SparckContext inside mapPartitions?

2014-09-30 Thread Henry Hung
Hi All, A noob question: How to get SparckContext inside mapPartitions? Example: Let's say I have rddObjects that can be split into different partitions to be assigned to multiple executors, to speed up the export data from database. Variable sc is created in the main program using these

IPython Notebook Debug Spam

2014-09-30 Thread Rick Richardson
I am experiencing significant logging spam when running PySpark in IPython Notebok Exhibit A: http://i.imgur.com/BDP0R2U.png I have taken into consideration advice from: http://apache-spark-user-list.1001560.n3.nabble.com/Disable-all-spark-logging-td1960.html also

How to read just specified columns from parquet file using SparkSQL.

2014-09-30 Thread mykidong
Hi, I am new to SparkSQL. I want to read the specified columns from the parquet, not all the columns defined in the parquet file. For instance, the schema of the parquet file would look like this: { type: record, name: ElectricPowerUsage, namespace: jcascalog.parquet.example, fields: [