Re: [SparkSQL]Could not alter table in Spark 1.5 use HiveContext

2015-09-11 Thread StanZhai
Thanks a lot! I've fixed this issue by set: 
spark.sql.hive.metastore.version = 0.13.1
spark.sql.hive.metastore.jars = maven


Yin Huai-2 wrote
> Yes, Spark 1.5 use Hive 1.2's metastore client by default. You can change
> it by putting the following settings in your spark conf.
> 
> spark.sql.hive.metastore.version = 0.13.1
> spark.sql.hive.metastore.jars = maven or the path of your hive 0.13 jars
> and hadoop jars
> 
> For spark.sql.hive.metastore.jars, basically, it tells spark sql where to
> find metastore client's classes of Hive 0.13.1. If you set it to maven, we
> will download needed jars directly (it is an easy way to do testing work).
> 
> On Thu, Sep 10, 2015 at 7:45 PM, StanZhai 

> mail@

>  wrote:
> 
>> Thank you for the swift reply!
>>
>> The version of my hive metastore server is 0.13.1, I've build spark use
>> sbt
>> like this:
>> build/sbt -Pyarn -Phadoop-2.4 -Phive -Phive-thriftserver assembly
>>
>> Is spark 1.5 bind the hive client version of 1.2 by default?
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-developers-list.1001551.n3.nabble.com/SparkSQL-Could-not-alter-table-in-Spark-1-5-use-HiveContext-tp14029p14044.html
>> Sent from the Apache Spark Developers List mailing list archive at
>> Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: 

> dev-unsubscribe@.apache

>> For additional commands, e-mail: 

> dev-help@.apache

>>
>>





--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/SparkSQL-Could-not-alter-table-in-Spark-1-5-use-HiveContext-tp14029p14047.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: SparkR driver side JNI

2015-09-11 Thread Renyi Xiong
forgot to reply all.

I see. but what prevents e.g. R driver getting those command line arguments
from spark-submit and setting them with SparkConf to R diver's
in-process JVM through JNI?

On Thu, Sep 10, 2015 at 9:29 PM, Shivaram Venkataraman <
shiva...@eecs.berkeley.edu> wrote:

> Yeah in addition to the downside of having 2 JVMs the command line
> arguments and SparkConf etc. will be set by spark-submit in the first
> JVM which won't be available in the second JVM.
>
> Shivaram
>
> On Thu, Sep 10, 2015 at 5:18 PM, Renyi Xiong 
> wrote:
> > for 2nd case where JVM comes up first, we also can launch in-process JNI
> > just like inter-process mode, correct? (difference is that a 2nd JVM gets
> > loaded)
> >
> > On Thu, Aug 6, 2015 at 9:51 PM, Shivaram Venkataraman
> >  wrote:
> >>
> >> The in-process JNI only works out when the R process comes up first
> >> and we launch a JVM inside it. In many deploy modes like YARN (or
> >> actually in anything using spark-submit) the JVM comes up first and we
> >> launch R after that. Using an inter-process solution helps us cover
> >> both use cases
> >>
> >> Thanks
> >> Shivaram
> >>
> >> On Thu, Aug 6, 2015 at 8:33 PM, Renyi Xiong 
> wrote:
> >> > why SparkR chose to uses inter-process socket solution eventually on
> >> > driver
> >> > side instead of in-process JNI showed in one of its doc's below (about
> >> > page
> >> > 20)?
> >> >
> >> >
> >> >
> https://spark-summit.org/wp-content/uploads/2014/07/SparkR-Interactive-R-Programs-at-Scale-Shivaram-Vankataraman-Zongheng-Yang.pdf
> >> >
> >> >
> >
> >
>


Re: [ANNOUNCE] Announcing Spark 1.5.0

2015-09-11 Thread Reynold Xin
It is already there, but the search is not updated. Not sure what's going
on with maven central search.


http://repo1.maven.org/maven2/org/apache/spark/spark-parent_2.10/1.5.0/



On Fri, Sep 11, 2015 at 10:21 AM, Ryan Williams <
ryan.blake.willi...@gmail.com> wrote:

> Any idea why 1.5.0 is not in Maven central yet
> ?
> Is that a separate release process?
>
>
> On Wed, Sep 9, 2015 at 12:40 PM andy petrella 
> wrote:
>
>> You can try it out really quickly by "building" a Spark Notebook from
>> http://spark-notebook.io/.
>>
>> Just choose the master branch and 1.5.0, a correct hadoop version
>> (default to 2.2.0 though) and there you go :-)
>>
>>
>> On Wed, Sep 9, 2015 at 6:39 PM Ted Yu  wrote:
>>
>>> Jerry:
>>> I just tried building hbase-spark module with 1.5.0 and I see:
>>>
>>> ls -l ~/.m2/repository/org/apache/spark/spark-core_2.10/1.5.0
>>> total 21712
>>> -rw-r--r--  1 tyu  staff   196 Sep  9 09:37 _maven.repositories
>>> -rw-r--r--  1 tyu  staff  11081542 Sep  9 09:37 spark-core_2.10-1.5.0.jar
>>> -rw-r--r--  1 tyu  staff41 Sep  9 09:37
>>> spark-core_2.10-1.5.0.jar.sha1
>>> -rw-r--r--  1 tyu  staff 19816 Sep  9 09:37 spark-core_2.10-1.5.0.pom
>>> -rw-r--r--  1 tyu  staff41 Sep  9 09:37
>>> spark-core_2.10-1.5.0.pom.sha1
>>>
>>> FYI
>>>
>>> On Wed, Sep 9, 2015 at 9:35 AM, Jerry Lam  wrote:
>>>
 Hi Spark Developers,

 I'm eager to try it out! However, I got problems in resolving
 dependencies:
 [warn] [NOT FOUND  ]
 org.apache.spark#spark-core_2.10;1.5.0!spark-core_2.10.jar (0ms)
 [warn]  jcenter: tried

 When the package will be available?

 Best Regards,

 Jerry


 On Wed, Sep 9, 2015 at 9:30 AM, Dimitris Kouzis - Loukas <
 look...@gmail.com> wrote:

> Yeii!
>
> On Wed, Sep 9, 2015 at 2:25 PM, Yu Ishikawa <
> yuu.ishikawa+sp...@gmail.com> wrote:
>
>> Great work, everyone!
>>
>>
>>
>> -
>> -- Yu Ishikawa
>> --
>> View this message in context:
>> http://apache-spark-developers-list.1001551.n3.nabble.com/ANNOUNCE-Announcing-Spark-1-5-0-tp14013p14015.html
>> Sent from the Apache Spark Developers List mailing list archive at
>> Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>
>>
>

>>> --
>> andy
>>
>


Re: [ANNOUNCE] Announcing Spark 1.5.0

2015-09-11 Thread Jonathan Kelly
I just clicked the
http://search.maven.org/#search%7Cga%7C1%7Cg%3A%22org.apache.spark%22 link
provided above by Ryan, and I see 1.5.0. Was this just fixed within the
past hour, or is some caching causing some people not to see it?

On Fri, Sep 11, 2015 at 10:24 AM, Reynold Xin  wrote:

> It is already there, but the search is not updated. Not sure what's going
> on with maven central search.
>
>
> http://repo1.maven.org/maven2/org/apache/spark/spark-parent_2.10/1.5.0/
>
>
>
> On Fri, Sep 11, 2015 at 10:21 AM, Ryan Williams <
> ryan.blake.willi...@gmail.com> wrote:
>
>> Any idea why 1.5.0 is not in Maven central yet
>> ?
>> Is that a separate release process?
>>
>>
>> On Wed, Sep 9, 2015 at 12:40 PM andy petrella 
>> wrote:
>>
>>> You can try it out really quickly by "building" a Spark Notebook from
>>> http://spark-notebook.io/.
>>>
>>> Just choose the master branch and 1.5.0, a correct hadoop version
>>> (default to 2.2.0 though) and there you go :-)
>>>
>>>
>>> On Wed, Sep 9, 2015 at 6:39 PM Ted Yu  wrote:
>>>
 Jerry:
 I just tried building hbase-spark module with 1.5.0 and I see:

 ls -l ~/.m2/repository/org/apache/spark/spark-core_2.10/1.5.0
 total 21712
 -rw-r--r--  1 tyu  staff   196 Sep  9 09:37 _maven.repositories
 -rw-r--r--  1 tyu  staff  11081542 Sep  9 09:37
 spark-core_2.10-1.5.0.jar
 -rw-r--r--  1 tyu  staff41 Sep  9 09:37
 spark-core_2.10-1.5.0.jar.sha1
 -rw-r--r--  1 tyu  staff 19816 Sep  9 09:37
 spark-core_2.10-1.5.0.pom
 -rw-r--r--  1 tyu  staff41 Sep  9 09:37
 spark-core_2.10-1.5.0.pom.sha1

 FYI

 On Wed, Sep 9, 2015 at 9:35 AM, Jerry Lam  wrote:

> Hi Spark Developers,
>
> I'm eager to try it out! However, I got problems in resolving
> dependencies:
> [warn] [NOT FOUND  ]
> org.apache.spark#spark-core_2.10;1.5.0!spark-core_2.10.jar (0ms)
> [warn]  jcenter: tried
>
> When the package will be available?
>
> Best Regards,
>
> Jerry
>
>
> On Wed, Sep 9, 2015 at 9:30 AM, Dimitris Kouzis - Loukas <
> look...@gmail.com> wrote:
>
>> Yeii!
>>
>> On Wed, Sep 9, 2015 at 2:25 PM, Yu Ishikawa <
>> yuu.ishikawa+sp...@gmail.com> wrote:
>>
>>> Great work, everyone!
>>>
>>>
>>>
>>> -
>>> -- Yu Ishikawa
>>> --
>>> View this message in context:
>>> http://apache-spark-developers-list.1001551.n3.nabble.com/ANNOUNCE-Announcing-Spark-1-5-0-tp14013p14015.html
>>> Sent from the Apache Spark Developers List mailing list archive at
>>> Nabble.com.
>>>
>>> -
>>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: dev-h...@spark.apache.org
>>>
>>>
>>
>
 --
>>> andy
>>>
>>
>


Re: [ANNOUNCE] Announcing Spark 1.5.0

2015-09-11 Thread Ryan Williams
Any idea why 1.5.0 is not in Maven central yet
? Is
that a separate release process?

On Wed, Sep 9, 2015 at 12:40 PM andy petrella 
wrote:

> You can try it out really quickly by "building" a Spark Notebook from
> http://spark-notebook.io/.
>
> Just choose the master branch and 1.5.0, a correct hadoop version (default
> to 2.2.0 though) and there you go :-)
>
>
> On Wed, Sep 9, 2015 at 6:39 PM Ted Yu  wrote:
>
>> Jerry:
>> I just tried building hbase-spark module with 1.5.0 and I see:
>>
>> ls -l ~/.m2/repository/org/apache/spark/spark-core_2.10/1.5.0
>> total 21712
>> -rw-r--r--  1 tyu  staff   196 Sep  9 09:37 _maven.repositories
>> -rw-r--r--  1 tyu  staff  11081542 Sep  9 09:37 spark-core_2.10-1.5.0.jar
>> -rw-r--r--  1 tyu  staff41 Sep  9 09:37
>> spark-core_2.10-1.5.0.jar.sha1
>> -rw-r--r--  1 tyu  staff 19816 Sep  9 09:37 spark-core_2.10-1.5.0.pom
>> -rw-r--r--  1 tyu  staff41 Sep  9 09:37
>> spark-core_2.10-1.5.0.pom.sha1
>>
>> FYI
>>
>> On Wed, Sep 9, 2015 at 9:35 AM, Jerry Lam  wrote:
>>
>>> Hi Spark Developers,
>>>
>>> I'm eager to try it out! However, I got problems in resolving
>>> dependencies:
>>> [warn] [NOT FOUND  ]
>>> org.apache.spark#spark-core_2.10;1.5.0!spark-core_2.10.jar (0ms)
>>> [warn]  jcenter: tried
>>>
>>> When the package will be available?
>>>
>>> Best Regards,
>>>
>>> Jerry
>>>
>>>
>>> On Wed, Sep 9, 2015 at 9:30 AM, Dimitris Kouzis - Loukas <
>>> look...@gmail.com> wrote:
>>>
 Yeii!

 On Wed, Sep 9, 2015 at 2:25 PM, Yu Ishikawa <
 yuu.ishikawa+sp...@gmail.com> wrote:

> Great work, everyone!
>
>
>
> -
> -- Yu Ishikawa
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/ANNOUNCE-Announcing-Spark-1-5-0-tp14013p14015.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>

>>>
>> --
> andy
>


Re: MongoDB and Spark

2015-09-11 Thread Sandeep Giri
use map-reduce.

On Fri, Sep 11, 2015, 14:32 Mishra, Abhishek 
wrote:

> Hello ,
>
>
>
> Is there any way to query multiple collections from mongodb using spark
> and java.  And i want to create only one Configuration Object. Please help
> if anyone has something regarding this.
>
>
>
>
>
> Thank You
>
> Abhishek
>


New JavaRDD Inside JavaPairDStream

2015-09-11 Thread Rachana Srivastava
Hello all,

Can we invoke JavaRDD while processing stream from Kafka for example.  
Following code is throwing some serialization exception.  Not sure if this is 
feasible.

  JavaStreamingContext jssc = new JavaStreamingContext(jsc, 
Durations.seconds(5));
JavaPairReceiverInputDStream messages = 
KafkaUtils.createStream(jssc, zkQuorum, group, topicMap);
JavaDStream lines = messages.map(new Function, String>() {
  public String call(Tuple2 tuple2) { return tuple2._2();
  }
});
JavaPairDStream wordCounts = lines.mapToPair( new 
PairFunction() {
public Tuple2 call(String urlString) {
String propertiesFile = 
"/home/cloudera/Desktop/sample/input/featurelist.properties";
JavaRDD propertiesFileRDD = 
jsc.textFile(propertiesFile);
  JavaPairRDD featureKeyClassPair = 
propertiesFileRDD.mapToPair(
  new PairFunction() {
  public Tuple2 
call(String property) {
return new 
Tuple2(property.split("=")[0], property.split("=")[1]);
  }
 });
featureKeyClassPair.count();
  return new Tuple2(urlString,  featureScore);
}
  });



Re: SparkR driver side JNI

2015-09-11 Thread Renyi Xiong
got it! thanks a lot.

On Fri, Sep 11, 2015 at 11:10 AM, Shivaram Venkataraman <
shiva...@eecs.berkeley.edu> wrote:

> Its possible -- in the sense that a lot of designs are possible. But
> AFAIK there are no clean interfaces for getting all the arguments /
> SparkConf options from spark-submit and its all the more tricker to
> handle scenarios where the first JVM has already created a
> SparkContext that you want to use from R. The inter-process
> communication is cleaner, pretty lightweight and handles all the
> scenarios.
>
> Thanks
> Shivaram
>
> On Fri, Sep 11, 2015 at 10:54 AM, Renyi Xiong 
> wrote:
> > forgot to reply all.
> >
> > I see. but what prevents e.g. R driver getting those command line
> arguments
> > from spark-submit and setting them with SparkConf to R diver's in-process
> > JVM through JNI?
> >
> > On Thu, Sep 10, 2015 at 9:29 PM, Shivaram Venkataraman
> >  wrote:
> >>
> >> Yeah in addition to the downside of having 2 JVMs the command line
> >> arguments and SparkConf etc. will be set by spark-submit in the first
> >> JVM which won't be available in the second JVM.
> >>
> >> Shivaram
> >>
> >> On Thu, Sep 10, 2015 at 5:18 PM, Renyi Xiong 
> >> wrote:
> >> > for 2nd case where JVM comes up first, we also can launch in-process
> JNI
> >> > just like inter-process mode, correct? (difference is that a 2nd JVM
> >> > gets
> >> > loaded)
> >> >
> >> > On Thu, Aug 6, 2015 at 9:51 PM, Shivaram Venkataraman
> >> >  wrote:
> >> >>
> >> >> The in-process JNI only works out when the R process comes up first
> >> >> and we launch a JVM inside it. In many deploy modes like YARN (or
> >> >> actually in anything using spark-submit) the JVM comes up first and
> we
> >> >> launch R after that. Using an inter-process solution helps us cover
> >> >> both use cases
> >> >>
> >> >> Thanks
> >> >> Shivaram
> >> >>
> >> >> On Thu, Aug 6, 2015 at 8:33 PM, Renyi Xiong 
> >> >> wrote:
> >> >> > why SparkR chose to uses inter-process socket solution eventually
> on
> >> >> > driver
> >> >> > side instead of in-process JNI showed in one of its doc's below
> >> >> > (about
> >> >> > page
> >> >> > 20)?
> >> >> >
> >> >> >
> >> >> >
> >> >> >
> https://spark-summit.org/wp-content/uploads/2014/07/SparkR-Interactive-R-Programs-at-Scale-Shivaram-Vankataraman-Zongheng-Yang.pdf
> >> >> >
> >> >> >
> >> >
> >> >
> >
> >
>


SIGTERM 15 Issue : Spark Streaming for ingesting huge text files using custom Receiver

2015-09-11 Thread Varadhan, Jawahar
Hi all,   I have a coded a custom receiver which receives kafka messages. These 
Kafka messages have FTP server credentials in them. The receiver then opens the 
message and uses the ftp credentials in it  to connect to the ftp server. It 
then streams this huge text file (3.3G) . Finally this stream it read line by 
line using buffered reader and pushed to the spark streaming via the receiver's 
"store" method. Spark streaming process receives all these lines and stores it 
in hdfs.
With this process I could ingest small files (50 mb) but cant ingest this 3.3gb 
file.  I get a YARN exception of SIGTERM 15 in spark streaming process. Also, I 
tried going to that 3.3GB file directly (without custom receiver) in spark 
streaming using ssc.textFileStream  and everything works fine and that file 
ends in HDFS
Please let me know what I might have to do to get this working with receiver. I 
know there are better ways to ingest the file but we need to use Spark 
streaming in our case.
Thanks.

Re: Concurrency issue in SQLExecution.withNewExecutionId

2015-09-11 Thread Olivier Toupin
@Andrew_Or-2 I am using Scala futures.



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Concurrency-issue-in-SQLExecution-withNewExecutionId-tp14035p14068.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: MongoDB and Spark

2015-09-11 Thread Sandeep Giri
I think it should be possible by loading collections as RDD and then doing
a union on them.

Regards,
Sandeep Giri,
+1 347 781 4573 (US)
+91-953-899-8962 (IN)

www.KnowBigData.com. 
Phone: +1-253-397-1945 (Office)

[image: linkedin icon]  [image:
other site icon]   [image: facebook icon]
 [image: twitter icon]
 


On Fri, Sep 11, 2015 at 3:40 PM, Mishra, Abhishek  wrote:

> Anything using Spark RDD’s ???
>
>
>
> Abhishek
>
>
>
> *From:* Sandeep Giri [mailto:sand...@knowbigdata.com]
> *Sent:* Friday, September 11, 2015 3:19 PM
> *To:* Mishra, Abhishek; u...@spark.apache.org; dev@spark.apache.org
> *Subject:* Re: MongoDB and Spark
>
>
>
> use map-reduce.
>
>
>
> On Fri, Sep 11, 2015, 14:32 Mishra, Abhishek 
> wrote:
>
> Hello ,
>
>
>
> Is there any way to query multiple collections from mongodb using spark
> and java.  And i want to create only one Configuration Object. Please help
> if anyone has something regarding this.
>
>
>
>
>
> Thank You
>
> Abhishek
>
>


Re: Spark 1.5.x: Java files in src/main/scala and vice versa

2015-09-11 Thread lonikar
It does not cause any problem when building using maven. But when doing
eclipse:eclipse, the generated .classpath files contained only
. This caused
all the .scala sources to be ignored and caused all kinds of eclipse build
errors. It resolved only when I added prebuild jars in the java build path,
and it also prevented me from debugging spark code.

I understand eclipse:eclipse is not recommended way of creating eclipse
projects, but thats how I noticed this issue.

As sean said, its a matter of code organization, and its confusing to find
java files in src/main/scala. In my env, I moved the files and did not see
notice any issues. Unless there is any specific purpose, it will be better
if the code is reorganized.



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-1-5-x-Java-files-in-src-main-scala-and-vice-versa-tp14032p14052.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: SparkR driver side JNI

2015-09-11 Thread Shivaram Venkataraman
Its possible -- in the sense that a lot of designs are possible. But
AFAIK there are no clean interfaces for getting all the arguments /
SparkConf options from spark-submit and its all the more tricker to
handle scenarios where the first JVM has already created a
SparkContext that you want to use from R. The inter-process
communication is cleaner, pretty lightweight and handles all the
scenarios.

Thanks
Shivaram

On Fri, Sep 11, 2015 at 10:54 AM, Renyi Xiong  wrote:
> forgot to reply all.
>
> I see. but what prevents e.g. R driver getting those command line arguments
> from spark-submit and setting them with SparkConf to R diver's in-process
> JVM through JNI?
>
> On Thu, Sep 10, 2015 at 9:29 PM, Shivaram Venkataraman
>  wrote:
>>
>> Yeah in addition to the downside of having 2 JVMs the command line
>> arguments and SparkConf etc. will be set by spark-submit in the first
>> JVM which won't be available in the second JVM.
>>
>> Shivaram
>>
>> On Thu, Sep 10, 2015 at 5:18 PM, Renyi Xiong 
>> wrote:
>> > for 2nd case where JVM comes up first, we also can launch in-process JNI
>> > just like inter-process mode, correct? (difference is that a 2nd JVM
>> > gets
>> > loaded)
>> >
>> > On Thu, Aug 6, 2015 at 9:51 PM, Shivaram Venkataraman
>> >  wrote:
>> >>
>> >> The in-process JNI only works out when the R process comes up first
>> >> and we launch a JVM inside it. In many deploy modes like YARN (or
>> >> actually in anything using spark-submit) the JVM comes up first and we
>> >> launch R after that. Using an inter-process solution helps us cover
>> >> both use cases
>> >>
>> >> Thanks
>> >> Shivaram
>> >>
>> >> On Thu, Aug 6, 2015 at 8:33 PM, Renyi Xiong 
>> >> wrote:
>> >> > why SparkR chose to uses inter-process socket solution eventually on
>> >> > driver
>> >> > side instead of in-process JNI showed in one of its doc's below
>> >> > (about
>> >> > page
>> >> > 20)?
>> >> >
>> >> >
>> >> >
>> >> > https://spark-summit.org/wp-content/uploads/2014/07/SparkR-Interactive-R-Programs-at-Scale-Shivaram-Vankataraman-Zongheng-Yang.pdf
>> >> >
>> >> >
>> >
>> >
>
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Multithreaded vs Spark Executor

2015-09-11 Thread Rachana Srivastava
Hello all,

We are getting stream of input data from a Kafka queue using Spark Streaming 
API.  For each data element we want to run parallel threads to process a set of 
feature lists (nearly 100 feature or more).Since feature lists creation is 
independent of each other we would like to execute these feature lists in 
parallel on the input data that we get from the Kafka queue.

Question is

1. Should we write thread pool and manage these features execution on different 
threads in parallel.  Only concern is because of data locality we are confined 
to the node that is assigned to the input data from the Kafka stream we cannot 
leverage distributed nodes for processing of these features for a single input 
data.

2.  Or since we are using JavaRDD as a feature list, these feature execution 
will be managed internally by Spark executors.

Thanks,

Rachana


Re: MongoDB and Spark

2015-09-11 Thread Corey Nolet
Unfortunately, MongoDB does not directly expose its locality via its client
API so the problem with trying to schedule Spark tasks against it is that
the tasks themselves cannot be scheduled locally on nodes containing query
results- which means you can only assume most results will be sent over the
network to the task that needs to process it. This is bad. The other reason
(which is also related to the issue of locality) is that I'm not sure if
there's an easy way to spread the results of a query over multiple
different clients- thus you'd probably have to start your Spark RDD with a
single partition and then repartition. What you've done at that point is
you've taken data from multiple mongodb nodes and you've collected them on
a single node just to re-partition them, again across the network, onto
multiple nodes. This is also bad.

I think this is the reason it was recommended to use MongoDB's mapreduce
because they can use their locality information internally. I had this same
issue w/ Couchbase a couple years back- it's unfortunate but it's the
reality.




On Fri, Sep 11, 2015 at 9:34 AM, Sandeep Giri 
wrote:

> I think it should be possible by loading collections as RDD and then doing
> a union on them.
>
> Regards,
> Sandeep Giri,
> +1 347 781 4573 (US)
> +91-953-899-8962 (IN)
>
> www.KnowBigData.com. 
> Phone: +1-253-397-1945 (Office)
>
> [image: linkedin icon]  [image:
> other site icon]   [image: facebook icon]
>  [image: twitter icon]
>  
>
>
> On Fri, Sep 11, 2015 at 3:40 PM, Mishra, Abhishek <
> abhishek.mis...@xerox.com> wrote:
>
>> Anything using Spark RDD’s ???
>>
>>
>>
>> Abhishek
>>
>>
>>
>> *From:* Sandeep Giri [mailto:sand...@knowbigdata.com]
>> *Sent:* Friday, September 11, 2015 3:19 PM
>> *To:* Mishra, Abhishek; u...@spark.apache.org; dev@spark.apache.org
>> *Subject:* Re: MongoDB and Spark
>>
>>
>>
>> use map-reduce.
>>
>>
>>
>> On Fri, Sep 11, 2015, 14:32 Mishra, Abhishek 
>> wrote:
>>
>> Hello ,
>>
>>
>>
>> Is there any way to query multiple collections from mongodb using spark
>> and java.  And i want to create only one Configuration Object. Please help
>> if anyone has something regarding this.
>>
>>
>>
>>
>>
>> Thank You
>>
>> Abhishek
>>
>>
>


Re: Spark 1.5: How to trigger expression execution through UnsafeRow/TungstenProject

2015-09-11 Thread lonikar
thanks that worked 



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-1-5-How-to-trigger-expression-execution-through-UnsafeRow-TungstenProject-tp14026p14053.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Spark 1.5.0: setting up debug env

2015-09-11 Thread lonikar
I have setup spark debug env on windows and mac, and thought its worth
sharing given some of the issues I encountered and the  instructions given
here

  
did not work for *eclipse* (possibly outdated now). The first step "sbt/sbt"
or "build/sbt" hangs in downloading sbt with the message "Getting
org.scala-sbt sbt 0.13.7 ...". I tried the alternative "build/mvn
eclipse:eclipse", but that too failed as the generated .classpath files
contained classpathentry only for java files.

1. Build spark using maven on command line. This will download all the
necessary jars from maven repos and speed up eclipse build. Maven 3.3.3 is
required. Spark ships with it. Just use build/mvn and ensure that there is
no "mvn" command in PATH (build/mvn -Pyarn -Phadoop-2.6
-Dhadoop.version=2.6.0 -DskipTests clean package).
2. Download latest scala-ide (4.1.1 as of now) from http://scala-ide.org
3. Check if the eclipse scala maven plugin is installed. If not, install it:
Help --> Install New Software -->
http://alchim31.free.fr/m2e-scala/update-site/ which is sourced from
https://github.com/sonatype/m2eclipse-scala.
4. If using scala 2.10, add installation 2.10.4. If you build spark using
steps in  described here
  , (build/mvn
-Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0 -DskipTests clean package), it
gets installed in build/scala-2.10.4. In Eclipse Preferences -> Scala ->
Installations -> Add, specify the /build/scala-2.10.4/lib.
5. In Eclipse -> Project, disable Build Automatically. This is to avoid
building projects till all projects are imported and some settings are
changed. Otherwise, eclipse takes up hours building projects while in half
baked state.
6. In Eclipse -> Preferences -> Java -> Compiler -> Errors/Warnings -->
Deprecated and Restricted API, change the setting to Warning from earlier
Error. This is to take care of Unsafe classes for project tungsten.
7. Import maven projects: In eclipse, File --> Import --> Maven --> Existing
Maven Projects (*not General --> Existing projects in workspace*).
8. After the projects are completely imported, select all projects except
java8-tests_2.10, spark-assembly_2.10, spark-parent_2.10, right click and
choose Scala -> Set the Scala Installation. Choose 2.10.4. This step is also 
described here

 
. It does not work for some projects. Right click on each project,
Properties -> Scala Compiler -> Check Use Project Settings, Select Scala
Installation as scala 2.10.4 and click OK.
9. Some projects will give error "Plugin execution not covered by lifecycle
configuration" when building. The issue is  described here

 
. The pom.xml of those projects will need  ...
 around the  like below:

 The projects which need this change are spark-streaming-flume-sink_2.10
(external/flume-sink/pom.xml), spark-repl_2.10 (repl/pom.xml),
spark-sql_2.10 (sql/pom.xml), spark-hive_2.10 (sql/hive/pom.xml),
spark-hive-thriftserver_2.10 (sql/hive-thriftserver_2.10/pom.xml),
spark-unsafe_2.10 (unsafe/pom.xml).
10. Right click on project spark-streaming-flume-sink_2.10, Properties ->
Java Build Path -> Source -> Add Folder. Navigate to target -> scala-2.10 ->
src_managed -> main -> compiled_avro. Check the checkbox and click OK.
11. Now enable Project -> Build Automatically. Sit back and relax. If build
fails for some projects (SBT crashes sometimes), just select those, Project
-> Clean -> Clean selected projects.
12. After the build completes (hopefully without any errors), run/debug an
example from spark-examples_2.10. You should be able to put breakpoints in
spark code and debug. You may have to change source of examples to add
*/.setMaster("local")/* on the */val sparkConf/* line. After this minor
change, it will work. Also, the first time you debug, it will ask you
specify source path. Just select Add -> Java Project -> select all spark
projects. Let the first debugging session complete as will not show any
spark code. You may disable breakpoints in this session to let it go.
Subsequent sessions allow you to walk through step by step in spark code.
Enjoy  

You may not have to go through all this if using scala 2.11 or IntelliJ. But
if you are like me, who uses eclipse and also the spark's current scala
2.10.4, you will find this useful and avoid a lot of googling 

The one issue I encountered is debugging/setting breakpoints in expression
generated java code. This code generated as string in spark-catalyst_2.10
--> org.apache.spark.sql.catalyst.expressions and
org.apache.spark.sql.catalyst.expressions.codegen. If anyone has figured out
how to do it, please update on this thread.



--
View this message in context: 

Re: [ANNOUNCE] Announcing Spark 1.5.0

2015-09-11 Thread Ted Yu
This is related:
https://issues.apache.org/jira/browse/SPARK-10557

On Fri, Sep 11, 2015 at 10:21 AM, Ryan Williams <
ryan.blake.willi...@gmail.com> wrote:

> Any idea why 1.5.0 is not in Maven central yet
> ?
> Is that a separate release process?
>
>
> On Wed, Sep 9, 2015 at 12:40 PM andy petrella 
> wrote:
>
>> You can try it out really quickly by "building" a Spark Notebook from
>> http://spark-notebook.io/.
>>
>> Just choose the master branch and 1.5.0, a correct hadoop version
>> (default to 2.2.0 though) and there you go :-)
>>
>>
>> On Wed, Sep 9, 2015 at 6:39 PM Ted Yu  wrote:
>>
>>> Jerry:
>>> I just tried building hbase-spark module with 1.5.0 and I see:
>>>
>>> ls -l ~/.m2/repository/org/apache/spark/spark-core_2.10/1.5.0
>>> total 21712
>>> -rw-r--r--  1 tyu  staff   196 Sep  9 09:37 _maven.repositories
>>> -rw-r--r--  1 tyu  staff  11081542 Sep  9 09:37 spark-core_2.10-1.5.0.jar
>>> -rw-r--r--  1 tyu  staff41 Sep  9 09:37
>>> spark-core_2.10-1.5.0.jar.sha1
>>> -rw-r--r--  1 tyu  staff 19816 Sep  9 09:37 spark-core_2.10-1.5.0.pom
>>> -rw-r--r--  1 tyu  staff41 Sep  9 09:37
>>> spark-core_2.10-1.5.0.pom.sha1
>>>
>>> FYI
>>>
>>> On Wed, Sep 9, 2015 at 9:35 AM, Jerry Lam  wrote:
>>>
 Hi Spark Developers,

 I'm eager to try it out! However, I got problems in resolving
 dependencies:
 [warn] [NOT FOUND  ]
 org.apache.spark#spark-core_2.10;1.5.0!spark-core_2.10.jar (0ms)
 [warn]  jcenter: tried

 When the package will be available?

 Best Regards,

 Jerry


 On Wed, Sep 9, 2015 at 9:30 AM, Dimitris Kouzis - Loukas <
 look...@gmail.com> wrote:

> Yeii!
>
> On Wed, Sep 9, 2015 at 2:25 PM, Yu Ishikawa <
> yuu.ishikawa+sp...@gmail.com> wrote:
>
>> Great work, everyone!
>>
>>
>>
>> -
>> -- Yu Ishikawa
>> --
>> View this message in context:
>> http://apache-spark-developers-list.1001551.n3.nabble.com/ANNOUNCE-Announcing-Spark-1-5-0-tp14013p14015.html
>> Sent from the Apache Spark Developers List mailing list archive at
>> Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>
>>
>

>>> --
>> andy
>>
>