problem with data locality api

2014-09-28 Thread qinwei
Hi, everyone? ? I come across with a problem about data locality, i found these?example?code in 《Spark-on-YARN-A-Deep-Dive-Sandy-Ryza.pdf》? ??? ??val locData = InputFormatInfo.computePreferredLocations(Seq(new InputFormatInfo(conf, classOf[TextInputFormat], new Path(“myfile.txt”)))?? ???

problem with patitioning

2014-09-28 Thread qinwei
Hi, everyone    I come across a problem with changing the patition number of the rdd,  my code is as below:    val rdd1 = sc.textFile(path1)     val rdd2 = sc.textFile(path2)     val rdd3 = sc.textFile(path3)     val imeiList = parseParam(job.jobParams)     val broadcastVar =

Re: How to use multi thread in RDD map function ?

2014-09-28 Thread myasuka
Thank you for your reply, Actually, we have already used this parameter. Our cluster is a standalone cluster with 16 nodes, every node has 16 cores. We have 256 pairs matrices along with 256 tasks , when we set --total-executor-cores as 64, each node can launch 4 tasks simultaneously, each

RE: problem with data locality api

2014-09-28 Thread Shao, Saisai
Hi First conf is used for Hadoop to determine the locality distribution of HDFS file. Second conf is used for Spark, though with the same name, actually they are two different classes. Thanks Jerry From: qinwei [mailto:wei@dewmobile.net] Sent: Sunday, September 28, 2014 2:05 PM To: user

Re: How to use multi thread in RDD map function ?

2014-09-28 Thread Sean Owen
If increasing executors really isn't enough, then you can consider using mapPartitions to process whole partitions at a time. Within that you can multi thread your processing of the elements in the partition. (And you should probably use more like one worker per machine then.) The question is how

回复: RE: problem with data locality api

2014-09-28 Thread qinwei
Thank you for your reply, ? ? I understand your explaination, but i wonder what is the?correct usage of the apinew SparkContext(config: SparkConf, preferredNodeLocationData: Map[String, Set[SplitInfo]])how to construct the second param?preferredNodeLocationData?hope for

Re: problem with patitioning

2014-09-28 Thread Sean Owen
(Most of this code is not relevant to the question and can be refactored too. The casts and null checks look unnecessary.) You are unioning RDDs so you have a result with the sum of their partitions. The number of partitions is really a hint to Hadoop only so it is not even necessarily 3 x 1920.

spark multi-node cluster

2014-09-28 Thread codeoedoc
Hi guys, This is a spark fresh user... I'm trying to setup a spark cluster with multiple nodes, starting with 2. With one node, it is working fine. When I get a slave node, slave is able to register to the master node. However when I launch a spark shell, and when the executor is launched on the

Re: Re: problem with patitioning

2014-09-28 Thread qinwei
Thank you for your reply, and your tips on code refactoring is helpful, after a second look on the code, the casts and null check is really unnecessary. qinwei  From: Sean OwenDate: 2014-09-28 15:03To: qinweiCC: userSubject: Re: problem with patitioning(Most of this code is not relevant

Re: spark multi-node cluster

2014-09-28 Thread codeoedoc
BTW, I'm using standalone deployment (The name standalone deployment for cluster, is kind of misleading... I think the doc needs to be updated. It's not really standalone, but plain spark only deployment) Thx, cody On Sun, Sep 28, 2014 at 12:36 AM, codeoedoc codeoe...@gmail.com wrote: Hi

How to do broadcast join in SparkSQL

2014-09-28 Thread Jianshi Huang
I cannot find it in the documentation. And I have a dozen dimension tables to (left) join... Cheers, -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github Blog: http://huangjs.github.com/

Re: Build spark with Intellij IDEA 13

2014-09-28 Thread Yi Tian
Hi If you want IDEA compile your spark project (version 1.0.0 and above), you should do it with following steps. 1 clone spark project 2 use mvn to compile your spark project ( because you need the generated avro source file in flume-sink module) 3 open spark/pom.xml with IDEA 4 check profiles

[MLlib] LogisticRegressionWithSGD and LogisticRegressionWithLBFGS converge with different weights.

2014-09-28 Thread Yanbo Liang
Hi We have used LogisticRegression with two different optimization method SGD and LBFGS in MLlib. With the same dataset and the same training and test split, but get different weights vector. For example, we use spark-1.1.0/data/mllib/sample_binary_classification_data.txt as our training and

[SF Machine Learning meetup] talk by Prof. C J Lin, large-scale linear classification: status and changllenges

2014-09-28 Thread Chester At Work
All Sorry this is spark related, but I thought some of you in San Francisco might be interested in this talk. We announced this talk recently, it will be at the end of next month (oct) http://www.meetup.com/sfmachinelearning/events/208078582/ Prof CJ Lin is famous for his work on libsvm

Re: How to do broadcast join in SparkSQL

2014-09-28 Thread Ted Yu
Have you looked at SPARK-1800 ? e.g. see sql/core/src/test/scala/org/apache/spark/sql/JoinSuite.scala Cheers On Sun, Sep 28, 2014 at 1:55 AM, Jianshi Huang jianshi.hu...@gmail.com wrote: I cannot find it in the documentation. And I have a dozen dimension tables to (left) join... Cheers,

Re: How to do broadcast join in SparkSQL

2014-09-28 Thread Jianshi Huang
Yes, looks like it can only be controlled by the parameter spark.sql.autoBroadcastJoinThreshold, which is a little bit weird to me. How am I suppose to know the exact bytes of a table? Let me specify the join algorithm is preferred I think. Jianshi On Sun, Sep 28, 2014 at 11:57 PM, Ted Yu

Re: SparkSQL: map type MatchError when inserting into Hive table

2014-09-28 Thread Du Li
It turned out a bug in my code. In the select clause the list of fields is misaligned with the schema of the target table. As a consequence the map data couldn’t be cast to some other type in the schema. Thanks anyway. On 9/26/14, 8:08 PM, Cheng Lian lian.cs@gmail.com wrote: Would you mind

driver memory management

2014-09-28 Thread Brad Miller
Hi All, I am interested to collect() a large RDD so that I can run a learning algorithm on it. I've noticed that when I don't increase SPARK_DRIVER_MEMORY I can run out of memory. I've also noticed that it looks like the same fraction of memory is reserved for storage on the driver as on the

view not supported in spark thrift server?

2014-09-28 Thread Du Li
Can anybody confirm whether or not view is currently supported in spark? I found “create view translate” in the blacklist of HiveCompatibilitySuite.scala and also the following scenario threw NullPointerException on beeline/thriftserver (1.1.0). Any plan to support it soon? create table

Re: view not supported in spark thrift server?

2014-09-28 Thread Michael Armbrust
Views are not supported yet. Its not currently on the near term roadmap, but that can change if there is sufficient demand or someone in the community is interested in implementing them. I do not think it would be very hard. Michael On Sun, Sep 28, 2014 at 11:59 AM, Du Li

Re: view not supported in spark thrift server?

2014-09-28 Thread Du Li
Thanks, Michael, for your quick response. View is critical for my project that is migrating from shark to spark SQL. I have implemented and tested everything else. It would be perfect if view could be implemented soon. Du From: Michael Armbrust

Spark meetup on Oct 15 in NYC

2014-09-28 Thread Reynold Xin
Hi Spark users and developers, Some of the most active Spark developers (including Matei Zaharia, Michael Armbrust, Joseph Bradley, TD, Paco Nathan, and me) will be in NYC for Strata NYC. We are working with the Spark NYC meetup group and Bloomberg to host a meetup event. This might be the event

Re: driver memory management

2014-09-28 Thread Reynold Xin
The storage fraction only limits the amount of memory used for storage. It doesn't actually limit anything else. I.e you can use all the memory if you want in collect. On Sunday, September 28, 2014, Brad Miller bmill...@eecs.berkeley.edu wrote: Hi All, I am interested to collect() a large RDD

Spark SQL question: how to control the storage level of cached SchemaRDD?

2014-09-28 Thread Haopu Wang
Thanks for the response. From Spark Web-UI's Storage tab, I do see cached RDD there. But the storage level is Memory Deserialized 1x Replicated. How can I change the storage level? Because I have a big table there. Thanks! From: Cheng Lian

Re: Spark SQL question: how to control the storage level of cached SchemaRDD?

2014-09-28 Thread Michael Armbrust
This is not possible until https://github.com/apache/spark/pull/2501 is merged. On Sun, Sep 28, 2014 at 6:39 PM, Haopu Wang hw...@qilinsoft.com wrote: Thanks for the response. From Spark Web-UI's Storage tab, I do see cached RDD there. But the storage level is Memory Deserialized 1x

Re: Spark SQL question: how to control the storage level of cached SchemaRDD?

2014-09-28 Thread Michael Armbrust
You might consider instead storing the data using saveAsParquetFile and then querying that after running sqlContext.parquetFile(...).registerTempTable(...). On Sun, Sep 28, 2014 at 6:43 PM, Michael Armbrust mich...@databricks.com wrote: This is not possible until

Re: spark multi-node cluster

2014-09-28 Thread codeoedoc
Figured this out... documented here and hope can help others: http://koobehub.wordpress.com/2014/09/29/spark-the-standalone-cluster-deployment/ On Sun, Sep 28, 2014 at 12:36 AM, codeoedoc codeoe...@gmail.com wrote: Hi guys, This is a spark fresh user... I'm trying to setup a spark cluster

Re: Kinesis receiver spark streaming partition

2014-09-28 Thread Wei Liu
Chris, Think I will check back with you to see if you made progress on this issue. Any good news so far? Thanks. Once again, I really appreciate you look into this issue. Thanks, Wei On Thu, Aug 28, 2014 at 4:44 PM, Chris Fregly ch...@fregly.com wrote: great question, wei. this is very

Re: Using one sql query's result inside another sql query

2014-09-28 Thread twinkle sachdeva
Thanks Cheng. For the time being , As a work around, I had applied the schema to Queryresult1, and then registered the result as temp table. Although that works, but I was not sure of performance impact, as that might block some optimisation in some scenarios. This flow (on spark 1.1 ) works:

Re: Using one sql query's result inside another sql query

2014-09-28 Thread Cheng Lian
This workaround looks good to me. In this way, all queries are still executed lazily within a single DAG, and Spark SQL is capable to optimize the query plan as a whole. On 9/29/14 11:26 AM, twinkle sachdeva wrote: Thanks Cheng. For the time being , As a work around, I had applied the schema