Re: Spark 1.1.0 with Hadoop 2.5.0

2014-10-07 Thread Sean Owen
That's a Hive version issue, not Hadoop version issue. On Tue, Oct 7, 2014 at 7:21 AM, Li HM hmx...@gmail.com wrote: Thanks for the replied. Please refer to my another post entitled How to make ./bin/spark-sql work with hive. It has all the error/exceptions I am getting. If I understand you

HiveServer1 and SparkSQL

2014-10-07 Thread deenar.toraskar
Hi Shark supported both the HiveServer1 and HiveServer2 thrift interfaces (using $ bin/shark -service sharkserver[1 or 2]). SparkSQL seems to support only HiveServer2. I was wondering what is involved to add support for HiveServer1. Is this something straightforward to do that I can embark on

Re: Spark Streaming saveAsNewAPIHadoopFiles

2014-10-07 Thread Abraham Jacob
Hi All, Continuing on this discussion... Is there a good reason why the def of saveAsNewAPIHadoopFiles in org/apache/spark/streaming/api/java/JavaPairDStream.scala is defined like this - def saveAsNewAPIHadoopFiles( prefix: String, suffix: String, keyClass: Class[_],

Re: Strategies for reading large numbers of files

2014-10-07 Thread deenar.toraskar
Hi Landon I had a problem very similar to your, where we have to process around 5 million relatively small files on NFS. After trying various options, we did something similar to what Matei suggested. 1) take the original path and find the subdirectories under that path and then parallelize the

Parsing one big multiple line .xml loaded in RDD using Python

2014-10-07 Thread jan.zikes
Hi, I have already unsucesfully asked quiet simmilar question at stackoverflow, particularly here:  http://stackoverflow.com/questions/26202978/spark-and-python-trying-to-parse-wikipedia-using-gensim. I've also unsucessfully tryied some workaround, but unsucessfuly, workaround problem can be

Re: Relation between worker memory and executor memory in standalone mode

2014-10-07 Thread MEETHU MATHEW
Try to set --total-executor-cores to limit how many total cores it can use. Thanks Regards, Meethu M On Thursday, 2 October 2014 2:39 AM, Akshat Aranya aara...@gmail.com wrote: I guess one way to do so would be to run 1 worker per node, like say, instead of running 1 worker and giving

Re: Hive Parquet Serde from Spark

2014-10-07 Thread quintona
I have found related PRs in the parquet-mr project: https://github.com/Parquet/parquet-mr/issues/324, however using that version of the bundle doesn't solve the issue. The issue seems to related to packaged scope in separate class loaders. I am busy looking for a work around. -- View this

Cannot read from s3 using sc.textFile

2014-10-07 Thread Tomer Benyamini
Hello, I'm trying to read from s3 using a simple spark java app: - SparkConf sparkConf = new SparkConf().setAppName(TestApp); sparkConf.setMaster(local); JavaSparkContext sc = new JavaSparkContext(sparkConf); sc.hadoopConfiguration().set(fs.s3.awsAccessKeyId, XX);

Fwd: Cannot read from s3 using sc.textFile

2014-10-07 Thread Tomer Benyamini
Hello, I'm trying to read from s3 using a simple spark java app: - SparkConf sparkConf = new SparkConf().setAppName(TestApp); sparkConf.setMaster(local); JavaSparkContext sc = new JavaSparkContext(sparkConf); sc.hadoopConfiguration().set(fs.s3.awsAccessKeyId, XX);

Same code --works in spark 1.0.2-- but not in spark 1.1.0

2014-10-07 Thread MEETHU MATHEW
Hi all, My code was working fine in spark 1.0.2 ,but after upgrading to 1.1.0, its throwing exceptions and tasks are getting failed. The code contains some map and filter transformations followed by groupByKey (reduceByKey in another code ). What I could find out is that the code works fine

Re: GraphX: Types for the Nodes and Edges

2014-10-07 Thread Oshi
Hi again, Thank you for your suggestion :) I've tried to implement this method but I'm stuck trying to union the payload before creating the graph. Below is a really simplified snippet of what have worked so far. //Reading the articles given in json format val articles =

Re: Unable to ship external Python libraries in PYSPARK

2014-10-07 Thread yh18190
Hi David, Thanks for the reply and effort u put to explain the concepts.Thanks for example.It worked. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Unable-to-ship-external-Python-libraries-in-PYSPARK-tp14074p15844.html Sent from the Apache Spark User

Re: Spark 1.1.0 with Hadoop 2.5.0

2014-10-07 Thread Cheng Lian
The build command should be correct. What exact error did you encounter when trying Spark 1.1 + Hive 0.12 + Hadoop 2.5.0? On 10/7/14 2:21 PM, Li HM wrote: Thanks for the replied. Please refer to my another post entitled How to make ./bin/spark-sql work with hive. It has all the

Re: Spark SQL -- more than two tables for join

2014-10-07 Thread TANG Gen
Hi, the same problem happens when I try several joins together, such as 'SELECT * FROM sales INNER JOIN magasin ON sales.STO_KEY = magasin.STO_KEY INNER JOIN eans ON (sales.BARC_KEY = eans.BARC_KEY and magasin.FORM_KEY = eans.FORM_KEY)' The error information is as follow:

Re: Spark SQL -- more than two tables for join

2014-10-07 Thread Gen
Hi, in fact, the same problem happens when I try several joins together: SELECT * FROM sales INNER JOIN magasin ON sales.STO_KEY = magasin.STO_KEY INNER JOIN eans ON (sales.BARC_KEY = eans.BARC_KEY and magasin.FORM_KEY = eans.FORM_KEY) py4j.protocol.Py4JJavaError: An error occurred while

akka.remote.transport.netty.NettyTransport

2014-10-07 Thread Jacob Chacko - Catalyst Consulting
Hi All I have one master and one worker on AWS (amazon web service) and am running spark map reduce code provided on the link https://spark.apache.org/examples.html We are using Spark version 1.0.2 Word Count val file = spark.textFile(hdfs://...) val counts = file.flatMap(line = line.split( ))

Re: Parsing one big multiple line .xml loaded in RDD using Python

2014-10-07 Thread Davies Liu
Maybe sc.wholeTextFile() is what you want, you can get the whole text and parse it by yourself. On Tue, Oct 7, 2014 at 1:06 AM, jan.zi...@centrum.cz wrote: Hi, I have already unsucesfully asked quiet simmilar question at stackoverflow, particularly here:

Re: Cannot read from s3 using sc.textFile

2014-10-07 Thread Sunny Khatri
Not sure if it's supposed to work. Can you try newAPIHadoopFile() passing in the required configuration object. On Tue, Oct 7, 2014 at 4:20 AM, Tomer Benyamini tomer@gmail.com wrote: Hello, I'm trying to read from s3 using a simple spark java app: - SparkConf

Spark Streaming Fault Tolerance (?)

2014-10-07 Thread Massimiliano Tomassi
Reading the Spark Streaming Programming Guide I found a couple of interesting points. First of all, while talking about receivers, it says: *If the number of cores allocated to the application is less than or equal to the number of input DStreams / receivers, then the system will receive data,

Re: Kafka-HDFS to store as Parquet format

2014-10-07 Thread Soumitra Kumar
Currently I am not doing anything, if anything change start from scratch. In general I doubt there are many options to account for schema changes. If you are reading files using impala, then it may allow if the schema changes are append only. Otherwise existing Parquet files have to be migrated

Re: Kafka-HDFS to store as Parquet format

2014-10-07 Thread Buntu Dev
Thanks for the input Soumitra. On Tue, Oct 7, 2014 at 10:24 AM, Soumitra Kumar kumar.soumi...@gmail.com wrote: Currently I am not doing anything, if anything change start from scratch. In general I doubt there are many options to account for schema changes. If you are reading files using

Re: Cannot read from s3 using sc.textFile

2014-10-07 Thread Daniil Osipov
Try using s3n:// instead of s3 (for the credential configuration as well). On Tue, Oct 7, 2014 at 9:51 AM, Sunny Khatri sunny.k...@gmail.com wrote: Not sure if it's supposed to work. Can you try newAPIHadoopFile() passing in the required configuration object. On Tue, Oct 7, 2014 at 4:20 AM,

Re: return probability \ confidence instead of actual class

2014-10-07 Thread Sunny Khatri
Not familiar with scikit SVM implementation ( and I assume you are using linearSVC). To figure out an optimal decision boundary based on the scores obtained, you can use an ROC curve varying your thresholds. On Tue, Oct 7, 2014 at 12:08 AM, Adamantios Corais adamantios.cor...@gmail.com wrote:

Re: Spark SQL -- more than two tables for join

2014-10-07 Thread Matei Zaharia
The issue is that you're using SQLContext instead of HiveContext. SQLContext implements a smaller subset of the SQL language and so you're getting a SQL parse error because it doesn't support the syntax you have. Look at how you'd write this in HiveQL, and then try doing that with HiveContext.

Re: Stupid Spark question

2014-10-07 Thread Sean Owen
You can create a new Configuration object in something like a mapPartitions method and use that. It will pick up local Hadoop configuration from the node, but presumably the Spark workers and HDFS data nodes are colocated in this case, so the machines have the correct Hadoop config locally. On

Re: Spark 1.1.0 with Hadoop 2.5.0

2014-10-07 Thread Li HM
Thanks Cheng. Here is the error message after a fresh build. $ mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.5.0 -Phive -DskipTests clean package [INFO] [INFO] Reactor Summary: [INFO] [INFO] Spark Project Parent POM

MLLib Linear regression

2014-10-07 Thread Sameer Tilak
Hi All,I have following classes of features: class A: 15000 featuresclass B: 170 featuresclass C: 900 featuresClass D: 6000 features. I use linear regression (over sparse data). I get excellent results with low RMSE (~0.06) for the following combinations of classes:1. A + B + C 2. B + C + D3.

RE: MLLib Linear regression

2014-10-07 Thread Sameer Tilak
BTW, one detail: When number of iterations is 100 all weights are zero or below and the indices are only from set A. When number of iterations is 150 I see 30+ non-zero weights (when sorted by weight) and indices are distributed across al sets. however MSE is high (5.xxx) and the result does

Re: Shuffle files

2014-10-07 Thread SK
- We set ulimit to 50. But I still get the same too many open files warning. - I tried setting consolidateFiles to True, but that did not help either. I am using a Mesos cluster. Does Mesos have any limit on the number of open files? thanks -- View this message in context:

anyone else seeing something like https://issues.apache.org/jira/browse/SPARK-3637

2014-10-07 Thread Steve Lewis
java.lang.NullPointerException at java.nio.ByteBuffer.wrap(ByteBuffer.java:392) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:58) at org.apache.spark.scheduler.Task.run(Task.scala:54) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:199) at

Re: anyone else seeing something like https://issues.apache.org/jira/browse/SPARK-3637

2014-10-07 Thread Andrew Or
Hi Steve, what Spark version are you running? 2014-10-07 14:45 GMT-07:00 Steve Lewis lordjoe2...@gmail.com: java.lang.NullPointerException at java.nio.ByteBuffer.wrap(ByteBuffer.java:392) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:58) at

Re: MLLib Linear regression

2014-10-07 Thread Xiangrui Meng
Did you test different regularization parameters and step sizes? In the combination that works, I don't see A + D. Did you test that combination? Are there any linear dependency between A's columns and D's columns? -Xiangrui On Tue, Oct 7, 2014 at 1:56 PM, Sameer Tilak ssti...@live.com wrote:

Spark / Kafka connector - CDH5 distribution

2014-10-07 Thread Abraham Jacob
Hi All, Does anyone know if CDH5.1.2 packages the spark streaming kafka connector under the spark externals project? -- ~

Storing shuffle files on a Tachyon

2014-10-07 Thread Soumya Simanta
Is it possible to store spark shuffle files on Tachyon ?

SparkStreaming program does not start

2014-10-07 Thread spr
I'm probably doing something obviously wrong, but I'm not seeing it. I have the program below (in a file try1.scala), which is similar but not identical to the examples. import org.apache.spark._ import org.apache.spark.streaming._ import org.apache.spark.streaming.StreamingContext._

Re: Spark / Kafka connector - CDH5 distribution

2014-10-07 Thread Abraham Jacob
Thanks Sean, Sorry in my earlier question I meant to type CDH5.1.3 not CDH5.1.2 I presume it's included in spark-streaming_2.10-1.0.0-cdh5.1.3 But for some reason eclipse complains that import org.apache.spark.streaming.kafka cannon be resolved, even though I have included the

RE: Shuffle files

2014-10-07 Thread Lisonbee, Todd
Are you sure the new ulimit has taken effect? How many cores are you using? How many reducers? In general if a node in your cluster has C assigned cores and you run a job with X reducers then Spark will open C*X files in parallel and start writing. Shuffle

Re: SparkStreaming program does not start

2014-10-07 Thread Abraham Jacob
Try using spark-submit instead of spark-shell On Tue, Oct 7, 2014 at 3:47 PM, spr s...@yarcdata.com wrote: I'm probably doing something obviously wrong, but I'm not seeing it. I have the program below (in a file try1.scala), which is similar but not identical to the examples. import

Re: Shuffle files

2014-10-07 Thread Sunny Khatri
@SK: Make sure ulimit has taken effect as Todd mentioned. You can verify via ulimit -a. Also make sure you have proper kernel parameters set in /etc/sysctl.conf (MacOSX) On Tue, Oct 7, 2014 at 3:57 PM, Lisonbee, Todd todd.lison...@intel.com wrote: Are you sure the new ulimit has taken effect?

Re: SparkStreaming program does not start

2014-10-07 Thread spr
|| Try using spark-submit instead of spark-shell Two questions: - What does spark-submit do differently from spark-shell that makes you think that may be the cause of my difficulty? - When I try spark-submit it complains about Error: Cannot load main class from JAR:

Re: Spark / Kafka connector - CDH5 distribution

2014-10-07 Thread Abraham Jacob
Never mind... my bad... made a typo. looks good. Thanks, On Tue, Oct 7, 2014 at 3:57 PM, Abraham Jacob abe.jac...@gmail.com wrote: Thanks Sean, Sorry in my earlier question I meant to type CDH5.1.3 not CDH5.1.2 I presume it's included in spark-streaming_2.10-1.0.0-cdh5.1.3 But for some

bug with IPython notebook?

2014-10-07 Thread Andy Davidson
Hi I think I found a bug in the iPython notebook integration. I am not sure how to report it I am running spark-1.1.0-bin-hadoop2.4 on an AWS ec2 cluster. I start the cluster using the launch script provided by spark I start iPython notebook on my cluster master as follows and use an ssh tunnel

spark fold question

2014-10-07 Thread chinchu
Hi, I am using the fold(zeroValue)(t1, t2) on the RDD I noticed that it runs in parallel on all the partitions then aggregates the results from the partitions. My data object is not aggregate-able I was wondering if there's any way to run the fold sequentially. [I am looking to do a foldLeft

Spark-Shell: OOM: GC overhead limit exceeded

2014-10-07 Thread sranga
Hi I am new to Spark and trying to develop an application that loads data from Hive. Here is my setup: * Spark-1.1.0 (built using -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -Phive) * Executing Spark-shell on a box with 16 GB RAM * 4 Cores Single Processor * OpenCSV library (SerDe) * Hive table

Re: Record-at-a-time model for Spark Streaming

2014-10-07 Thread Tobias Pfeiffer
Jianneng, On Wed, Oct 8, 2014 at 8:44 AM, Jianneng Li jiannen...@berkeley.edu wrote: I understand that Spark Streaming uses micro-batches to implement streaming, while traditional streaming systems use the record-at-a-time processing model. The performance benefit of the former is throughput,

Re: dynamic sliding window duration

2014-10-07 Thread Tobias Pfeiffer
Hi, On Wed, Oct 8, 2014 at 4:50 AM, Josh J joshjd...@gmail.com wrote: I have a source which fluctuates in the frequency of streaming tuples. I would like to process certain batch counts, rather than batch window durations. Is it possible to either 1) define batch window sizes Cf.

Re: Spark 1.1.0 with Hadoop 2.5.0

2014-10-07 Thread Li HM
Here is the hive-site.xml ?xml version=1.0? ?xml-stylesheet type=text/xsl href=configuration.xsl? configuration !-- Hive Execution Parameters -- property namehive.metastore.local/name valuefalse/value descriptioncontrols whether to connect to remove metastore server or open a new

Re: Reading from HBase is too slow

2014-10-07 Thread Tao Xiao
I found the reason why reading HBase is too slow. Although each regionserver serves multiple regions for the table I'm reading, the number of Spark workers allocated by Yarn is too low. Actually, I could see that the table has dozens of regions spread over about 20 regionservers, but only two

Re: Shuffle files

2014-10-07 Thread Andrew Ash
You will need to restart your Mesos workers to pick up the new limits as well. On Tue, Oct 7, 2014 at 4:02 PM, Sunny Khatri sunny.k...@gmail.com wrote: @SK: Make sure ulimit has taken effect as Todd mentioned. You can verify via ulimit -a. Also make sure you have proper kernel parameters set

Re: Same code --works in spark 1.0.2-- but not in spark 1.1.0

2014-10-07 Thread Andrew Ash
Hi Meethu, I believe you may be hitting a regression in https://issues.apache.org/jira/browse/SPARK-3633 If you are able, could you please try running a patched version of Spark 1.1.0 that has commit 4fde28c reverted and see if the errors go away? Posting your results on that bug would be

Support for Parquet V2 in ParquetTableSupport?

2014-10-07 Thread Michael Allman
Hello, I was interested in testing Parquet V2 with Spark SQL, but noticed after some investigation that the parquet writer that Spark SQL uses is fixed at V1 here: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableSupport.scala#L350.