Re: [ERROR]: Spark 1.5.2 + Hbase 1.1 + Hive 1.2 + HbaseIntegration

2016-04-07 Thread Wojciech Indyk
Hello Divya!
Have you solved the problem?
I suppose the log comes from driver. You need to look also at logs on
worker JVMs, there can be an exception or something.
Do you have Kerberos on your cluster? It could be similar to a problem
http://issues.apache.org/jira/browse/SPARK-14115

Based on your logs:
> 16/02/29 23:09:34 INFO ClientCnxn: Opening socket connection to server
> localhost/0:0:0:0:0:0:0:1:2181. Will not attempt to authenticate using SASL
> (unknown error)
> 16/02/29 23:09:34 INFO ClientCnxn: Socket connection established to
> localhost/0:0:0:0:0:0:0:1:2181, initiating session
> 16/02/29 23:09:34 INFO ClientCnxn: Session establishment complete on
> server localhost/0:0:0:0:0:0:0:1:2181, sessionid = 0x3532fb70ba20035,

Maybe there is a problem with using RPC call to regions using IPv6
(but I just guess).

--
Kind regards/ Pozdrawiam,
Wojciech Indyk
http://datacentric.pl


2016-03-01 5:27 GMT+01:00 Divya Gehlot :
> Hi,
> I am getting error when I am trying to connect hive table (which is being
> created through HbaseIntegration) in spark
>
> Steps I followed :
> *Hive Table creation code  *:
> CREATE EXTERNAL TABLE IF NOT EXISTS TEST(NAME STRING,AGE INT)
> STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
> WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,0:AGE")
> TBLPROPERTIES ("hbase.table.name" = "TEST",
> "hbase.mapred.output.outputtable" = "TEST");
>
>
> *DESCRIBE TEST ;*
> col_namedata_typecomment
> namestring from deserializer
> age   int from deserializer
>
>
> *Spark Code :*
> import org.apache.spark._
> import org.apache.spark.sql._
>
> val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
> hiveContext.sql("from TEST SELECT  NAME").collect.foreach(println)
>
>
> *Starting Spark shell*
> spark-shell --jars
> /usr/hdp/2.3.4.0-3485/hive/lib/guava-14.0.1.jar,/usr/hdp/2.3.4.0-3485/hive/lib/hive-hbase-handler.jar,/usr/hdp/2.3.4.0-3485/hbase/lib/hbase-client.jar,/usr/hdp/2.3.4.0-3485/hbase/lib/hbase-common.jar,/usr/hdp/2.3.4.0-3485/hbase/lib/hbase-protocol.jar,/usr/hdp/2.3.4.0-3485/hbase/lib/hbase-server.jar,/usr/hdp/2.3.4.0-3485/hive/lib/hive-hbase-handler.jar,/usr/hdp/2.3.4.0-3485/hive/lib/htrace-core-3.1.0-incubating.jar,/usr/hdp/2.3.4.0-3485/hive/lib/zookeeper-3.4.6.2.3.4.0-3485.jar,/usr/hdp/2.3.4.0-3485/hbase/lib/hbase-server.jar
> --driver-class-path
> /usr/hdp/2.3.4.0-3485/hive/lib/guava-14.0.1.jar,/usr/hdp/2.3.4.0-3485/hbase/lib/hbase-client.jar,/usr/hdp/2.3.4.0-3485/hbase/lib/hbase-common.jar,/usr/hdp/2.3.4.0-3485/hbase/lib/hbase-protocol.jar,/usr/hdp/2.3.4.0-3485/hbase/lib/hbase-server.jar,/usr/hdp/2.3.4.0-3485/hive/lib/hive-hbase-handler.jar,/usr/hdp/2.3.4.0-3485/hive/lib/htrace-core-3.1.0-incubating.jar,/usr/hdp/2.3.4.0-3485/hive/lib/zookeeper-3.4.6.2.3.4.0-3485.jar,/usr/hdp/2.3.4.0-3485/hive/lib/hive-hbase-handler.jar,/usr/hdp/2.3.4.0-3485/hbase/lib/hbase-server.jar
> --packages com.databricks:spark-csv_2.10:1.3.0  --master yarn-client -i
> /TestDivya/Spark/InstrumentCopyToHDFSHive.scala
>
> *Stack Trace* :
>
> Stack SQL context available as sqlContext.
>> Loading /TestDivya/Spark/InstrumentCopyToHDFSHive.scala...
>> import org.apache.spark._
>> import org.apache.spark.sql._
>> 16/02/29 23:09:29 INFO HiveContext: Initializing execution hive, version
>> 1.2.1
>> 16/02/29 23:09:29 INFO ClientWrapper: Inspected Hadoop version:
>> 2.7.1.2.3.4.0-3485
>> 16/02/29 23:09:29 INFO ClientWrapper: Loaded
>> org.apache.hadoop.hive.shims.Hadoop23Shims for Hadoop version
>> 2.7.1.2.3.4.0-3485
>> 16/02/29 23:09:29 INFO HiveContext: default warehouse location is
>> /user/hive/warehouse
>> 16/02/29 23:09:29 INFO HiveContext: Initializing HiveMetastoreConnection
>> version 1.2.1 using Spark classes.
>> 16/02/29 23:09:29 INFO ClientWrapper: Inspected Hadoop version:
>> 2.7.1.2.3.4.0-3485
>> 16/02/29 23:09:29 INFO ClientWrapper: Loaded
>> org.apache.hadoop.hive.shims.Hadoop23Shims for Hadoop version
>> 2.7.1.2.3.4.0-3485
>> 16/02/29 23:09:30 WARN NativeCodeLoader: Unable to load native-hadoop
>> library for your platform... using builtin-java classes where applicable
>> 16/02/29 23:09:30 INFO metastore: Trying to connect to metastore with URI
>> thrift://ip-xxx-xx-xx-xxx.ap-southeast-1.compute.internal:9083
>> 16/02/29 23:09:30 INFO metastore: Connected to metastore.
>> 16/02/29 23:09:30 WARN DomainSocketFactory: The short-circuit local reads
>> feature cannot be used because libhadoop cannot be loaded.
>> 16/02/29 23:09:31 INFO SessionState: Created local directory:
>> /tmp/1bf53785-f7c8-406d-a733-a5858ccb2d16_resources
>> 16/02/29 23:09:31 INFO SessionState: Created HDFS directory:
>> /tmp/hive/hdfs/1bf53785-f7c8-406d-a733-a5858ccb2d16
>> 16/02/29 23:09:31 INFO SessionState: Created local directory:
>> /tmp/hdfs/1bf53785-f7c8-406d-a733-a5858ccb2d16
>> 16/02/29 23:09:31 INFO SessionState: Created HDFS directory:
>> /tmp/hive/hdfs/1bf53785-f7c8-406d-a733-a5858ccb2d16/_tmp_space.db
>> hiveContext: org.apache.spark.sql.hiv

About nested RDD

2016-04-07 Thread Tenghuan He
Hi all,

I know that nested RDDs are not possible like linke rdd1.map(x => x +
rdd2.count())
I tried to create a custome RDD like following

class MyRDD(base: RDD, part: Partitioner) extends RDD[(K, V)] {

var rdds = new  ArrayBuffer.empty[RDD[(K, (V, Int))]]
def update(rdd: RDD[_]) {
  udds += rdd
}
def comput ...
def getPartitions ...
}

In the compute method I call the internal rdds' iterators and got
NullPointerException
Is this also a form of nested RDDs and how do I get rid of this?

Thanks.


Tenghuan


how to use udf in spark thrift server.

2016-04-07 Thread zhanghn
I want to define some UDFs in my spark ENV.

And server it in thrift server. So I can use these UDFs in my beeline
connection. 

At first I tried start it with udf-jars and create functions in hive.

 

In spark-sql , I can add temp functions like "CREATE TEMPORARY FUNCTION
bsdUpper AS 'org.hue.udf.MyUpper';" , and it works well ,

but when I add functions like "CREATE FUNCTION bsdupperlong AS
'org.hue.udf.MyUpper' USING JAR 'hdfs://ns1/bsdtest2/myudfs.jar';" .

this command goes well , and I can see this function in metastore. 

But when I use this function "bsdupperlong" as I expected, it turns out ,
spark cannot find this function.

When I add this function again , it show an exception , means this function
already exist.

 

Then I found this unresolved ISSU :

https://issues.apache.org/jira/browse/SPARK-11609

So, .. It's a bug that has not been fixed.

 

Then I found this.

https://mail-archives.apache.org/mod_mbox/spark-user/201510.mbox/%3CCADONuiR
elyy7ebcj6qr3guo023upmvnt_d_gadrnbavhpor...@mail.gmail.com%3E

 

It tells me that I can Register in context , then start the Thrift Server
using startWithContext

from the spark shell.

But I failed on google to found an article to show me how start the Thrift
Server using startWithContext

from the spark shell is done.

 

Is anybody can help me with that ? 

 

Or find some other solution , that I can use UDFs in my beeline client. 

 

Thanks a lot.

 

Version of spark . spark-1.5.2

Hadoop 2.7.1 

Hive 1.2.1 

 

UDF code in jar is like this .

package org.hue.udf;

 

import org.apache.hadoop.hive.ql.exec.UDF;

import org.apache.hadoop.io.Text;

import java.util.UUID;

 

public final class MyUUID extends UDF {

  public String evaluate() {

  UUID uuid = UUID.randomUUID();

  return uuid.toString();

  }

}

 

 



[HELP:]Save Spark Dataframe in Phoenix Table

2016-04-07 Thread Divya Gehlot
Hi,
I hava a Hortonworks Hadoop cluster having below Configurations :
Spark 1.5.2
HBASE 1.1.x
Phoenix 4.4

I am able to connect to Phoenix through JDBC connection and able to read
the Phoenix tables .
But while writing the data back to Phoenix table
I am getting below error :

org.apache.spark.sql.AnalysisException:
org.apache.phoenix.spark.DefaultSource does not allow user-specified
schemas.;

Can any body help in resolving the above errors or any other solution of
saving Spark Dataframes to Phoenix.

Would really appareciate the help.

Thanks,
Divya


Re: Running Spark on Yarn-Client/Cluster mode

2016-04-07 Thread ashesh_28
Hi , 

I am also attaching a screenshot of my ResourceManager UI which shows the
available cores and memory allocated for each node , 

 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Running-Spark-on-Yarn-Client-Cluster-mode-tp26691p26710.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Running Spark on Yarn-Client/Cluster mode

2016-04-07 Thread ashesh_28
Hi Guys , 

Thanks for your valuable inputs , I have tried few alternatives as suggested
but it all leads me to same result - Unable to start Spark Context 

@Dhiraj Peechara
I am able to start my spark SC(SparkContext) in stand-alone mode by just
issuing the *$spark-shell* command from the terminal , so it makes me
believe that the HADOOP_CONF_DIR is set correctly . But just for
confirmation i have double check the same and the variable is correctly
pointing to the installed path . I am attaching  content of my Spark-env.sh
file . Let me know if you think something needs to be modified to get it all
rite.
spark-env.txt
 
 

@jasmine

i did try to include the  into the
spark-assembly.jar path . but it didnot solve the problem , but it did gives
a different error now.I have also tried to set the SPARK_JAR variable in
spark-env.sh file but no success. I also tried using the below command , 

*spark-shell --master yarn-client --conf
spark.yarn.jar=hdfs://ptfhadoop01v:8020/user/spark/share/lib/spark-assembly.jar*

Issuing this command gives me the following error message , 
Spark-Error.txt

  

I have not setup anything in my *spark-defaults.conf* file , I am not sure
if that is mandatory to make it all work.I can confirm that my YARN daemons
namely (ResourceManager & NodeManager) are running in the cluster .

I am also attaching a copy of my *yarn-site.xml* just to make sure its all
correct and not missing in any required property.

yarn-site.txt
 
 

I hope i can get over this soon , Thanks again guys for your quick thoughts
on this issue.

Regards
Ashesh



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Running-Spark-on-Yarn-Client-Cluster-mode-tp26691p26709.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Is Hive CREATE DATABASE IF NOT EXISTS atomic

2016-04-07 Thread Mich Talebzadeh
If you are using hiveContext to create a Hive database it will work.

In general you should use Hive to create a Hive database and create tables
within the already existing Hive database from Spark.

Make sure that you qualify  with 

sql("DROP TABLE IF EXISTS accounts.ll_18740868")
var sqltext : String = ""
sqltext = """
CREATE TABLE accounts.ll_18740868 (
TransactionDateDATE
,TransactionType   String
,SortCode  String
,AccountNumber String
,TransactionDescriptionString
,DebitAmount   Double
,CreditAmount  Double
,Balance   Double
)
COMMENT 'from csv file from excel sheet'
STORED AS ORC
TBLPROPERTIES ( "orc.compress"="ZLIB" )
"""
sql(sqltext)


HTH


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 7 April 2016 at 23:23, antoniosi  wrote:

> Hi,
>
> I am using hiveContext.sql("create database if not exists ") to
> create a hive db. Is this statement atomic?
>
> Thanks.
>
> Antonio.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Is-Hive-CREATE-DATABASE-IF-NOT-EXISTS-atomic-tp26706.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


MLlib ALS MatrixFactorizationModel.save fails consistently

2016-04-07 Thread Colin Woodbury
Hi all,

I've implemented most of a content recommendation system for a client.
However, whenever I attempt to save a MatrixFactorizationModel I've
trained, I see one of four outcomes:

1. Despite "save" being wrapped in a "try" block, I see a massive stack
trace quoting some java.io classes. The Model isn't written.
2. Same as the above, but the Model *is* written. It's unusable however, as
it's missing many of the files it should have, particularly in the
"product" folder.
3. Same as the above, but sbt crashes completely.
4. No massive stack trace, and the Model seems to be written. Upon being
loaded by another Spark context and fed a user ID, it claims the user isn't
present in the Model.

Case 4 is pretty rare. I see these failures both locally and when I test on
a Google Cloud instance with much better resources.

Note that `ALS.trainImplicit` and `model.save` are being called from within
a Future. Could it be possible that Play threads are closing before Spark
can finish, thus interrupting it somehow?

We are running Spark 1.6.1 within Play 2.4 and Scala 2.11. All these
failures have occurred while in Play's Dev mode in SBT.

Thanks for any insight you can give.


Re: ordering over structs

2016-04-07 Thread Imran Akbar
thanks Michael,


I'm trying to implement the code in pyspark like so (where my dataframe has
3 columns - customer_id, dt, and product):

st = StructType().add("dt", DateType(), True).add("product", StringType(),
True)

top = data.select("customer_id", st.alias('vs'))
  .groupBy("customer_id")
  .agg(max("dt").alias("vs"))
  .select("customer_id", "vs.dt", "vs.product")

But I get an error saying:

AttributeError: 'StructType' object has no attribute 'alias'

Can I do this without aliasing the struct?  Or am I doing something
incorrectly?


regards,

imran

On Wed, Apr 6, 2016 at 4:16 PM, Michael Armbrust 
wrote:

> Ordering for a struct goes in order of the fields.  So the max struct is
>> the one with the highest TotalValue (and then the highest category
>>   if there are multiple entries with the same hour and total value).
>>
>> Is this due to "InterpretedOrdering" in StructType?
>>
>
> That is one implementation, but the code generated ordering also follows
> the same contract.
>
>
>
>>  4)  Is it faster doing it this way than doing a join or window function
>> in Spark SQL?
>>
>> Way faster.  This is a very efficient way to calculate argmax.
>>
>> Can you explain how this is way faster than window function? I can
>> understand join doesn't make sense in this case. But to calculate the
>> grouping max, you just have to shuffle the data by grouping keys. You maybe
>> can do a combiner on the mapper side before shuffling, but that is it. Do
>> you mean windowing function in Spark SQL won't do any map side combiner,
>> even it is for max?
>>
>
> Windowing can't do partial aggregation and will have to collect all the
> data for a group so that it can be sorted before applying the function.  In
> contrast a max aggregation will do partial aggregation (map side combining)
> and can be calculated in a streaming fashion.
>
> Also, aggregation is more common and thus has seen more optimization
> beyond the theoretical limits described above.
>
>


Re: Anyone have a tutorial or guide to implement Spark + AWS + Caffe/CUDA?

2016-04-07 Thread jamborta
Hi Alfredo,

I have been building something similar and found that EMR is not suitable
for this, as the gpu instances don't come with nvidia drivers (and the
bootstrap process does not allow to restart instances). 

The way I'm setting up is based on the spark-ec2 script where you can use
custom AMIs which can contain your dockerized application plus all the
drivers and dependencies installed. I haven't completed my project yet, but
haven't run into any major blocker so far.

thanks,



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Anyone-have-a-tutorial-or-guide-to-implement-Spark-AWS-Caffe-CUDA-tp26705p26707.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Is Hive CREATE DATABASE IF NOT EXISTS atomic

2016-04-07 Thread Xiao Li
Hi,

Assuming you are using 1.6 or before, this is a native Hive command.

Basically, the execution of Database creation is completed by Hive.

Thanks,

Xiao Li

2016-04-07 15:23 GMT-07:00 antoniosi :

> Hi,
>
> I am using hiveContext.sql("create database if not exists ") to
> create a hive db. Is this statement atomic?
>
> Thanks.
>
> Antonio.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Is-Hive-CREATE-DATABASE-IF-NOT-EXISTS-atomic-tp26706.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Is Hive CREATE DATABASE IF NOT EXISTS atomic

2016-04-07 Thread antoniosi
Hi,

I am using hiveContext.sql("create database if not exists ") to
create a hive db. Is this statement atomic?

Thanks.

Antonio.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Is-Hive-CREATE-DATABASE-IF-NOT-EXISTS-atomic-tp26706.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Only 60% of Total Spark Batch Application execution time spent in Task Processing

2016-04-07 Thread Ted Yu
Which Spark release are you using ?

Have you registered to all the events provided by SparkListener ?

If so, can you do event-wise summation of execution time ?

Thanks

On Thu, Apr 7, 2016 at 11:03 AM, JasmineGeorge  wrote:

> We are running a batch job with the following specifications
> •   Building RandomForest with config : maxbins=100, depth=19, num of
> trees =
> 20
> •   Multiple runs with different input data size 2.8 GB, 10 Million
> records
> •   We are running spark application on Yarn in cluster mode, with 3
> Node
> Managers(each with 16 virtual cores and 96G RAM)
> •   Spark config :
> o   spark.driver.cores = 2
> o   spark.driver.memory = 32 G
> o   spark.executor.instances = 5  and spark.executor.cores = 8 so 40
> cores in
> total.
> o   spark.executor.memory= 32G so total executor memory around 160 G.
>
> We are collecting execution times for the tasks using a SparkListener, and
> also the total execution time for the application from the Spark Web UI.
> Across all the tests we saw consistently that,  sum total of the execution
> times of all the tasks is accounting to about 60% of the total application
> run time.
> We are just kind of wondering where is the rest of the 40% of the time
> being
> spent.
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Only-60-of-Total-Spark-Batch-Application-execution-time-spent-in-Task-Processing-tp26703.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Running Spark on Yarn-Client/Cluster mode

2016-04-07 Thread JasmineGeorge
The logs are self explanatory. 

It says "java.io.IOException: Incomplete HDFS URI, no host:
hdfs:/user/hduser/share/lib/spark-assembly.jar"


you need to specify the host in the above hdfs url.
It should look something like the following:

hdfs://:8020/user/hduser/share/lib/spark-assembly.jar



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Running-Spark-on-Yarn-Client-Cluster-mode-tp26691p26704.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Only 60% of Total Spark Batch Application execution time spent in Task Processing

2016-04-07 Thread JasmineGeorge
We are running a batch job with the following specifications
•   Building RandomForest with config : maxbins=100, depth=19, num of trees 
=
20
•   Multiple runs with different input data size 2.8 GB, 10 Million records
•   We are running spark application on Yarn in cluster mode, with 3 Node
Managers(each with 16 virtual cores and 96G RAM)
•   Spark config :  
o   spark.driver.cores = 2
o   spark.driver.memory = 32 G
o   spark.executor.instances = 5  and spark.executor.cores = 8 so 40 cores 
in
total.
o   spark.executor.memory= 32G so total executor memory around 160 G.

We are collecting execution times for the tasks using a SparkListener, and
also the total execution time for the application from the Spark Web UI.
Across all the tests we saw consistently that,  sum total of the execution
times of all the tasks is accounting to about 60% of the total application
run time.
We are just kind of wondering where is the rest of the 40% of the time being
spent.




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Only-60-of-Total-Spark-Batch-Application-execution-time-spent-in-Task-Processing-tp26703.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Working with zips in pyspark

2016-04-07 Thread tminima
I have n zips in a directory and I want to extract each one of those and then
get some data out of a file or two lying inside the zips and add it to a
graph DB. All of my zips are in a HDFS directory.

I am thinking my code should be along these lines.

# Names of all my zips
zip_names = ["a.zip", "b.zip", "c.zip"]

   # function extract_&_populate_graphDB() returns 1 after doing all the
work.
   # This was done so that a closure can be applied to start the spark job.
   sc.parallelize(zip_names).map(extract_&_populate_grapDB).reduce(lambda a,
b: a+b)

What I cant do to achieve this is how to extract the zips and read the files
lying within. I am able to read all the zips but I can't save those to the
HDFS. Here is the code

def ze(x):
in_memory_data = io.BytesIO(x[1])
file_obj = zipfile.ZipFile(in_memory_data, "r")
return file_obj

a = sc.binaryFiles("hdfs:/Testing/*.zip")
a.map(ze).collect()

The above code returns me a list of zipfile.ZipFile objects.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Working-with-zips-in-pyspark-tp26701.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: How to remove empty strings from JavaRDD

2016-04-07 Thread Nirmal Manoharan
Hi Greg,
I use something similar to this in my application but not for empty string.
So the below example is not tested but it should work.

JavaRDD filteredJavaRDD = example.filter(new
Function(){
public Boolean call(String arg0) throws Exception {
return (!arg0.equals(""));
}
});

Thanks!
Nirmal
On Thu, Apr 7, 2016 at 8:27 AM Chris Miller  wrote:

> flatmap?
>
>
> --
> Chris Miller
>
> On Thu, Apr 7, 2016 at 10:25 PM, greg huang  wrote:
>
>> Hi All,
>>
>>Can someone give me a example code to get rid of the empty string in
>> JavaRDD? I kwon there is a filter method in JavaRDD:
>> https://spark.apache.org/docs/1.6.0/api/java/org/apache/spark/rdd/RDD.html#filter(scala.Function1)
>>
>> Regards,
>>Greg
>>
>
>


RE: mapWithState not compacting removed state

2016-04-07 Thread Iain Cundy
Hi Ofir

I've discovered compaction works in 1.6.0 if I switch off Kryo.

I was using a workaround to get around mapWithState not supporting Kryo. See 
https://issues.apache.org/jira/browse/SPARK-12591

My custom KryoRegistrator Java class has

// workaround until bug fixes in spark 1.6.1
kryo.register(OpenHashMapBasedStateMap.class);
kryo.register(EmptyStateMap.class);
kryo.register(MapWithStateRDDRecord.class);

which certainly made the nullPointerException errors when checkpointing go 
away, but (inexplicably to me) doesn't allow compaction to work.

I wonder whether the "proper" Kryo support fix in 1.6.1 enables compaction? Has 
anybody seen compaction working with the patch?

If you are relying on compaction you do need to be aware of the implementation 
semantics, which may explain why your state is growing for longer than you 
expect.

I sent those to the list recently, but I can repeat them to you if you can’t 
find them. I think
--conf spark.streaming.sessionByKey.deltaChainThreshold=2
is critical if you want compaction more often than once every 190 batches.

Your application will slow down dramatically if your state grows too large for 
your container memory, which you can monitor in the task data in 
ApplicationMaster web UI.

Cheers
Iain

From: Ofir Kerker [mailto:ofir.ker...@gmail.com]
Sent: 07 April 2016 17:06
To: Iain Cundy; user@spark.apache.org
Subject: Re: mapWithState not compacting removed state

Hi Iain,
Did you manage to solve this issue?
It looks like we have a similar issue with processing time increasing every 
micro-batch but only after 30 batches.

Thanks.

On Thu, Mar 3, 2016 at 4:45 PM Iain Cundy 
mailto:iain.cu...@amdocs.com>> wrote:
Hi All

I’m aggregating data using mapWithState with a timeout set in 1.6.0. It broadly 
works well and by providing access to the key and the time in the callback 
allows a much more elegant solution for time based aggregation than the old 
updateStateByKey function.

However there seems to be a problem – the size of the state and the time taken 
to iterate over it for each micro-batch keeps increasing over time, long after 
the number of ‘current’ keys settles down. We start removing keys after just 
over an hour, but the size of the state keeps increasing in runs of over 6 
hours.

Essentially we start by just adding keys for our input tuples, reaching a peak 
of about 7 million keys. Then we start to output data and remove keys – the 
number of keys drops to about 5 million. We continue processing tuples, which 
adds keys, while removing the keys we no longer need – the number of keys 
fluctuates up and down between 5 million and  8 million.

We know this, and are reasonably confident our removal of keys is correct, 
because we obtain the state with JavaMapWithStateDStream.stateSnapshots and 
count the keys.

From my reading (I don’t know scala!) of the code in 
org.apache.spark.streaming.util.StateMap.scala it seems clear that the removed 
keys are only marked as deleted and are really destroyed subsequently by 
compaction, based upon the length of the chain of delta maps. We’d expect the 
size of the state RDDs and the time taken to iterate over all the state to 
stabilize once compaction is run after we remove keys, but it just doesn’t 
happen.

Is there some possible reason why compaction never gets run?

I tried to use the (undocumented?) config setting 
spark.streaming.sessionByKey.deltaChainThreshold to try to control how often 
compaction is run with:
--conf spark.streaming.sessionByKey.deltaChainThreshold=2

I can see it in the Spark application UI Environment page, but it doesn’t seem 
to make any difference.

I have noticed that the timeout mechanism only gets invoked on every 10th 
micro-batch. I’m almost sure it isn’t a coincidence that the checkpoint 
interval is also 10 micro-batches. I assume that is an intentional performance 
optimization. However because I have a lot of keys, I have a large micro-batch 
duration, so it would make sense for me to reduce that factor of 10. However, 
since I don’t call checkpoint on the state stream I can’t see how to change it?

Can I change the checkpoint interval  somewhere? [I tried calling 
JavaMapWithStateDStream.checkpoint myself, but that evidently isn’t the same 
thing!]

My initial assumption was that there is a new deltaMap for each micro-batch, 
but having noticed the timeout behavior I wonder if there is only a new 
deltaMap for each checkpoint? Or maybe there are other criteria?

Perhaps compaction just hasn’t run before my application falls over? Can anyone 
clarify exactly when it should run?

Or maybe compaction doesn’t delete old removed keys for some reason?

Thank you for your attention.

Cheers
Iain Cundy

This message and the information contained herein is proprietary and 
confidential and subject to the Amdocs policy statement, you may review at 
http://www.amdocs.com/email_disclaimer.asp


Re: HashingTF "compatibility" across Python, Scala?

2016-04-07 Thread Nick Pentreath
You're right Sean, the implementation depends on hash code currently so may
differ. I opened a JIRA (which duplicated this one -
https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-10574
which is the active JIRA), for using murmurhash3 which should then be
consistent across platforms & langs (as well as more performant).

It's also odd (legacy I think) that the Python version has its own
implementation rather than calling into Java. That should also be changed
probably.
On Thu, 7 Apr 2016 at 17:59, Sean Owen  wrote:

> Let's say I use HashingTF in my Pipeline to hash a string feature.
> This is available in Python and Scala, but they hash strings to
> different values since both use their respective runtime's native hash
> implementation. This means that I create different feature vectors for
> the same input. While I can load/store something like a
> NaiveBayesModel across the two languages successfully, it seems like
> the hashing part doesn't translate.
>
> Is that accurate, or, have I completely missed a way to get the same
> hashing for the same input across languages?
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: building kafka project on intellij Help is much appreciated

2016-04-07 Thread Ted Yu
This is the version of Kafka Spark depends on:

[INFO] +- org.apache.kafka:kafka_2.10:jar:0.8.2.1:compile

On Thu, Apr 7, 2016 at 9:14 AM, Haroon Rasheed 
wrote:

> Try removing libraryDependencies += "org.apache.kafka" %% "kafka" % "1.6.0"
> compile. I guess the internal dependencies are automatically pulled when
> you add spark-streaming-kafka_2.10.
>
> Also try changing the version to 1.6.1 or lower. Just to see if the links
> are broken.
>
> Regards,
> Haroon Syed
>
> On 7 April 2016 at 09:08, Sudhanshu Janghel <
> sudhanshu.jang...@cloudwick.com> wrote:
>
>> Hello,
>>
>> I am new to building kafka and wish to understand how to make fat jars in
>> intellij.
>> The sbt assembly seems confusing and I am unable to resolve the
>> dependencies.
>>
>> here is my build.sbt
>>
>> name := "twitter"
>>
>> version := "1.0"
>> scalaVersion := "2.10.4"
>>
>>
>>
>> //libraryDependencies += "org.slf4j" % "slf4j-api" % "1.7.7" % "provided"
>> //libraryDependencies += "org.slf4j" % "slf4j-log4j12" % "1.7.7" % "provided"
>> //libraryDependencies += "com.google.guava" % "guava" % "11.0.2" 
>> exclude("log4j", "log4j") exclude("org.slf4j","slf4j-log4j12") 
>> exclude("org.slf4j","slf4j-api")
>> libraryDependencies += "org.apache.spark" % "spark-streaming_2.10" % "1.6.0"
>> libraryDependencies +=   "org.apache.kafka" %% "kafka"  % "1.6.0"
>> libraryDependencies += "org.apache.spark" % "spark-streaming-kafka_2.10" % 
>> "1.6.0"
>>
>>
>>
>> adn here is my assembly.sbt
>>
>> addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.14.3")
>>
>>
>> the error faced is
>>
>> Error:Error while importing SBT project:...[info] Resolving 
>> org.scala-sbt#tracking;0.13.8 ...
>> [info] Resolving org.scala-sbt#cache;0.13.8 ...
>> [info] Resolving org.scala-sbt#testing;0.13.8 ...
>> [info] Resolving org.scala-sbt#test-agent;0.13.8 ...
>> [info] Resolving org.scala-sbt#test-interface;1.0 ...
>> [info] Resolving org.scala-sbt#main-settings;0.13.8 ...
>> [info] Resolving org.scala-sbt#apply-macro;0.13.8 ...
>> [info] Resolving org.scala-sbt#command;0.13.8 ...
>> [info] Resolving org.scala-sbt#logic;0.13.8 ...
>> [info] Resolving org.scala-sbt#precompiled-2_8_2;0.13.8 ...
>> [info] Resolving org.scala-sbt#precompiled-2_9_2;0.13.8 ...
>> [info] Resolving org.scala-sbt#precompiled-2_9_3;0.13.8 ...
>> [trace] Stack trace suppressed: run 'last *:update' for the full output.
>> [trace] Stack trace suppressed: run 'last *:ssExtractDependencies' for the 
>> full output.
>> [error] (*:update) sbt.ResolveException: unresolved dependency: 
>> com.eed3si9n#sbt-assembly;0.14.3: not found
>> [error] unresolved dependency: org.apache.kafka#kafka_2.10;1.6.0: not found
>> [error] (*:ssExtractDependencies) sbt.ResolveException: unresolved 
>> dependency: com.eed3si9n#sbt-assembly;0.14.3: not found
>> [error] unresolved dependency: org.apache.kafka#kafka_2.10;1.6.0: not found
>> [error] Total time: 18 s, completed Apr 7, 2016 5:05:06 PM
>>
>>
>>
>>
>


Re: building kafka project on intellij Help is much appreciated

2016-04-07 Thread Haroon Rasheed
Try removing libraryDependencies += "org.apache.kafka" %% "kafka" %
"1.6.0" compile.
I guess the internal dependencies are automatically pulled when you add
spark-streaming-kafka_2.10.

Also try changing the version to 1.6.1 or lower. Just to see if the links
are broken.

Regards,
Haroon Syed

On 7 April 2016 at 09:08, Sudhanshu Janghel  wrote:

> Hello,
>
> I am new to building kafka and wish to understand how to make fat jars in
> intellij.
> The sbt assembly seems confusing and I am unable to resolve the
> dependencies.
>
> here is my build.sbt
>
> name := "twitter"
>
> version := "1.0"
> scalaVersion := "2.10.4"
>
>
>
> //libraryDependencies += "org.slf4j" % "slf4j-api" % "1.7.7" % "provided"
> //libraryDependencies += "org.slf4j" % "slf4j-log4j12" % "1.7.7" % "provided"
> //libraryDependencies += "com.google.guava" % "guava" % "11.0.2" 
> exclude("log4j", "log4j") exclude("org.slf4j","slf4j-log4j12") 
> exclude("org.slf4j","slf4j-api")
> libraryDependencies += "org.apache.spark" % "spark-streaming_2.10" % "1.6.0"
> libraryDependencies +=   "org.apache.kafka" %% "kafka"  % "1.6.0"
> libraryDependencies += "org.apache.spark" % "spark-streaming-kafka_2.10" % 
> "1.6.0"
>
>
>
> adn here is my assembly.sbt
>
> addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.14.3")
>
>
> the error faced is
>
> Error:Error while importing SBT project:...[info] Resolving 
> org.scala-sbt#tracking;0.13.8 ...
> [info] Resolving org.scala-sbt#cache;0.13.8 ...
> [info] Resolving org.scala-sbt#testing;0.13.8 ...
> [info] Resolving org.scala-sbt#test-agent;0.13.8 ...
> [info] Resolving org.scala-sbt#test-interface;1.0 ...
> [info] Resolving org.scala-sbt#main-settings;0.13.8 ...
> [info] Resolving org.scala-sbt#apply-macro;0.13.8 ...
> [info] Resolving org.scala-sbt#command;0.13.8 ...
> [info] Resolving org.scala-sbt#logic;0.13.8 ...
> [info] Resolving org.scala-sbt#precompiled-2_8_2;0.13.8 ...
> [info] Resolving org.scala-sbt#precompiled-2_9_2;0.13.8 ...
> [info] Resolving org.scala-sbt#precompiled-2_9_3;0.13.8 ...
> [trace] Stack trace suppressed: run 'last *:update' for the full output.
> [trace] Stack trace suppressed: run 'last *:ssExtractDependencies' for the 
> full output.
> [error] (*:update) sbt.ResolveException: unresolved dependency: 
> com.eed3si9n#sbt-assembly;0.14.3: not found
> [error] unresolved dependency: org.apache.kafka#kafka_2.10;1.6.0: not found
> [error] (*:ssExtractDependencies) sbt.ResolveException: unresolved 
> dependency: com.eed3si9n#sbt-assembly;0.14.3: not found
> [error] unresolved dependency: org.apache.kafka#kafka_2.10;1.6.0: not found
> [error] Total time: 18 s, completed Apr 7, 2016 5:05:06 PM
>
>
>
>


building kafka project on intellij Help is much appreciated

2016-04-07 Thread Sudhanshu Janghel
Hello,

I am new to building kafka and wish to understand how to make fat jars in
intellij.
The sbt assembly seems confusing and I am unable to resolve the
dependencies.

here is my build.sbt

name := "twitter"

version := "1.0"
scalaVersion := "2.10.4"



//libraryDependencies += "org.slf4j" % "slf4j-api" % "1.7.7" % "provided"
//libraryDependencies += "org.slf4j" % "slf4j-log4j12" % "1.7.7" % "provided"
//libraryDependencies += "com.google.guava" % "guava" % "11.0.2"
exclude("log4j", "log4j") exclude("org.slf4j","slf4j-log4j12")
exclude("org.slf4j","slf4j-api")
libraryDependencies += "org.apache.spark" % "spark-streaming_2.10" % "1.6.0"
libraryDependencies +=   "org.apache.kafka" %% "kafka"  % "1.6.0"
libraryDependencies += "org.apache.spark" %
"spark-streaming-kafka_2.10" % "1.6.0"



adn here is my assembly.sbt

addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.14.3")


the error faced is

Error:Error while importing SBT project:...[info]
Resolving org.scala-sbt#tracking;0.13.8 ...
[info] Resolving org.scala-sbt#cache;0.13.8 ...
[info] Resolving org.scala-sbt#testing;0.13.8 ...
[info] Resolving org.scala-sbt#test-agent;0.13.8 ...
[info] Resolving org.scala-sbt#test-interface;1.0 ...
[info] Resolving org.scala-sbt#main-settings;0.13.8 ...
[info] Resolving org.scala-sbt#apply-macro;0.13.8 ...
[info] Resolving org.scala-sbt#command;0.13.8 ...
[info] Resolving org.scala-sbt#logic;0.13.8 ...
[info] Resolving org.scala-sbt#precompiled-2_8_2;0.13.8 ...
[info] Resolving org.scala-sbt#precompiled-2_9_2;0.13.8 ...
[info] Resolving org.scala-sbt#precompiled-2_9_3;0.13.8 ...
[trace] Stack trace suppressed: run 'last *:update' for the full output.
[trace] Stack trace suppressed: run 'last *:ssExtractDependencies' for
the full output.
[error] (*:update) sbt.ResolveException: unresolved dependency:
com.eed3si9n#sbt-assembly;0.14.3: not found
[error] unresolved dependency: org.apache.kafka#kafka_2.10;1.6.0: not found
[error] (*:ssExtractDependencies) sbt.ResolveException: unresolved
dependency: com.eed3si9n#sbt-assembly;0.14.3: not found
[error] unresolved dependency: org.apache.kafka#kafka_2.10;1.6.0: not found
[error] Total time: 18 s, completed Apr 7, 2016 5:05:06 PM


Re: Dataframe to parquet using hdfs or parquet block size

2016-04-07 Thread Buntu Dev
I tried setting both the hdfs and parquet block size but write to parquet
did not seem to have had any effect on the total number of blocks or the
average block size. Here is what I did:


sqlContext.setConf("dfs.blocksize", "134217728")
sqlContext.setConf("parquet.block.size", "134217728")
sqlContext.setConf("spark.broadcast.blockSize", "134217728")
df.write.parquet("/path/to/dest")


I tried the same with different block sizes but none had any effect. Is
this the right way to set the properties using setConf()?


Thanks!


On Wed, Apr 6, 2016 at 11:28 PM, bdev  wrote:

> I need to save the dataframe to parquet format and need some input on
> choosing the appropriate block size to help efficiently
> parallelize/localize
> the data to the executors. Should I be using parquet block size or hdfs
> block size and what is the optimal block size to use on a 100 node cluster?
>
>
> Thanks!
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Dataframe-to-parquet-using-hdfs-or-parquet-block-size-tp26693.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: mapWithState not compacting removed state

2016-04-07 Thread Ofir Kerker
Hi Iain,
Did you manage to solve this issue?
It looks like we have a similar issue with processing time increasing every
micro-batch but only after 30 batches.

Thanks.

On Thu, Mar 3, 2016 at 4:45 PM Iain Cundy  wrote:

> Hi All
>
>
>
> I’m aggregating data using mapWithState with a timeout set in 1.6.0. It
> broadly works well and by providing access to the key and the time in the
> callback allows a much more elegant solution for time based aggregation
> than the old updateStateByKey function.
>
>
>
> However there seems to be a problem – the size of the state and the time
> taken to iterate over it for each micro-batch keeps increasing over time,
> long after the number of ‘current’ keys settles down. We start removing
> keys after just over an hour, but the size of the state keeps increasing in
> runs of over 6 hours.
>
>
>
> Essentially we start by just adding keys for our input tuples, reaching a
> peak of about 7 million keys. Then we start to output data and remove keys
> – the number of keys drops to about 5 million. We continue processing
> tuples, which adds keys, while removing the keys we no longer need – the
> number of keys fluctuates up and down between 5 million and  8 million.
>
>
>
> We know this, and are reasonably confident our removal of keys is correct,
> because we obtain the state with JavaMapWithStateDStream.stateSnapshots and
> count the keys.
>
>
>
> From my reading (I don’t know scala!) of the code in
> org.apache.spark.streaming.util.StateMap.scala it seems clear that the
> removed keys are only marked as deleted and are really destroyed
> subsequently by compaction, based upon the length of the chain of delta
> maps. We’d expect the size of the state RDDs and the time taken to iterate
> over all the state to stabilize once compaction is run after we remove
> keys, but it just doesn’t happen.
>
>
>
> Is there some possible reason why compaction never gets run?
>
>
>
> I tried to use the (undocumented?) config setting
> spark.streaming.sessionByKey.deltaChainThreshold to try to control how
> often compaction is run with:
>
> --conf spark.streaming.sessionByKey.deltaChainThreshold=2
>
>
>
> I can see it in the Spark application UI Environment page, but it doesn’t
> seem to make any difference.
>
>
>
> I have noticed that the timeout mechanism only gets invoked on every 10th
> micro-batch. I’m almost sure it isn’t a coincidence that the checkpoint
> interval is also 10 micro-batches. I assume that is an intentional
> performance optimization. However because I have a lot of keys, I have a
> large micro-batch duration, so it would make sense for me to reduce that
> factor of 10. However, since I don’t call checkpoint on the state stream I
> can’t see how to change it?
>
>
>
> Can I change the checkpoint interval  somewhere? [I tried calling
> JavaMapWithStateDStream.checkpoint myself, but that evidently isn’t the
> same thing!]
>
>
>
> My initial assumption was that there is a new deltaMap for each
> micro-batch, but having noticed the timeout behavior I wonder if there is
> only a new deltaMap for each checkpoint? Or maybe there are other criteria?
>
>
>
> Perhaps compaction just hasn’t run before my application falls over? Can
> anyone clarify exactly when it should run?
>
>
>
> Or maybe compaction doesn’t delete old removed keys for some reason?
>
>
>
> Thank you for your attention.
>
>
>
> Cheers
>
> Iain Cundy
>
>
> This message and the information contained herein is proprietary and
> confidential and subject to the Amdocs policy statement, you may review at
> http://www.amdocs.com/email_disclaimer.asp
>


HashingTF "compatibility" across Python, Scala?

2016-04-07 Thread Sean Owen
Let's say I use HashingTF in my Pipeline to hash a string feature.
This is available in Python and Scala, but they hash strings to
different values since both use their respective runtime's native hash
implementation. This means that I create different feature vectors for
the same input. While I can load/store something like a
NaiveBayesModel across the two languages successfully, it seems like
the hashing part doesn't translate.

Is that accurate, or, have I completely missed a way to get the same
hashing for the same input across languages?

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark on Mobile platforms

2016-04-07 Thread Luciano Resende
Take a look at Apache Quarks, it is more towards what you are looking for
and has the ability to integrate with Spark.

http://quarks.apache.org/

On Thu, Apr 7, 2016 at 4:50 AM, sakilQUB  wrote:

> Hi all,
>
> I have been trying to find if Spark can be run on a mobile device platform
> (Android preferably) to analyse mobile log data for some performance
> analysis. So, basically the idea is to collect and process the mobile log
> data within the mobile device using the Spark framework to allow real-time
> log data analysis, without offloading the data to remote server/cloud.
>
> Does anybody have any idea about whether running Spark on a mobile platform
> is supported by the existing Spark framework or is there any other option
> for this?
>
> Many thanks.
> Sakil
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-on-Mobile-platforms-tp26699.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


-- 
Luciano Resende
http://twitter.com/lresende1975
http://lresende.blogspot.com/


Re: How to process one partition at a time?

2016-04-07 Thread Andrei
Thanks everyone, both - `submitJob` and `PartitionPrunningRDD` - work for
me.

On Thu, Apr 7, 2016 at 8:22 AM, Hemant Bhanawat 
wrote:

> Apparently, there is another way to do it. You can try creating a
> PartitionPruningRDD and pass a partition filter function to it. This RDD
> will do the same thing that I suggested in my mail and you will not have to
> create a new RDD.
>
> Hemant Bhanawat 
> www.snappydata.io
>
> On Wed, Apr 6, 2016 at 5:35 PM, Sun, Rui  wrote:
>
>> Maybe you can try SparkContext.submitJob:
>>
>> *def **submitJob**[T, U, R](rdd: RDD
>> [T],
>>  processPartition:
>> (Iterator[T]) **⇒** U, partitions: Seq[Int], resultHandler: (Int, U) **⇒** 
>> Unit, resultFunc:
>> **⇒** R): SimpleFutureAction
>> [R]*
>>
>>
>>
>>
>>
>> *From:* Hemant Bhanawat [mailto:hemant9...@gmail.com]
>> *Sent:* Wednesday, April 6, 2016 7:16 PM
>> *To:* Andrei 
>> *Cc:* user 
>> *Subject:* Re: How to process one partition at a time?
>>
>>
>>
>> Instead of doing it in compute, you could rather override getPartitions
>> method of your RDD and return only the target partitions. This way tasks
>> for only target partitions will be created. Currently in your case, tasks
>> for all the partitions are getting created.
>>
>> I hope it helps. I would like to hear if you take some other approach.
>>
>>
>> Hemant Bhanawat 
>>
>> www.snappydata.io
>>
>>
>>
>> On Wed, Apr 6, 2016 at 3:49 PM, Andrei  wrote:
>>
>> I'm writing a kind of sampler which in most cases will require only 1
>> partition, sometimes 2 and very rarely more. So it doesn't make sense to
>> process all partitions in parallel. What is the easiest way to limit
>> computations to one partition only?
>>
>>
>>
>> So far the best idea I came to is to create a custom partition whose
>> `compute` method looks something like:
>>
>>
>>
>> def compute(split: Partition, context: TaskContext) = {
>>
>> if (split.index == targetPartition) {
>>
>> // do computation
>>
>> } else {
>>
>>// return empty iterator
>>
>> }
>>
>> }
>>
>>
>>
>>
>>
>> But it's quite ugly and I'm unlikely to be the first person with such a
>> need. Is there easier way to do it?
>>
>>
>>
>>
>>
>>
>>
>
>


Re: Spark on Mobile platforms

2016-04-07 Thread Michael Slavitch
You should consider mobile agents that feed data into a spark datacenter via 
spark streaming.


> On Apr 7, 2016, at 8:28 AM, Ashic Mahtab  wrote:
> 
> Spark may not be the right tool for this. Working on just the mobile device, 
> you won't be scaling out stuff, and as such most of the benefits of Spark 
> would be nullified. Moreover, it'd likely run slower than things that are 
> meant to work in a single process. Spark is also quite large, which is 
> another drawback in terms of mobile apps.
> 
> Perhaps check out Tensorflow, which may be better suited for this particular 
> requirement.
> 
> -Ashic.
> 
> > Date: Thu, 7 Apr 2016 04:50:18 -0700
> > From: sbarbhuiy...@qub.ac.uk 
> > To: user@spark.apache.org 
> > Subject: Spark on Mobile platforms
> > 
> > Hi all,
> > 
> > I have been trying to find if Spark can be run on a mobile device platform
> > (Android preferably) to analyse mobile log data for some performance
> > analysis. So, basically the idea is to collect and process the mobile log
> > data within the mobile device using the Spark framework to allow real-time
> > log data analysis, without offloading the data to remote server/cloud.
> > 
> > Does anybody have any idea about whether running Spark on a mobile platform
> > is supported by the existing Spark framework or is there any other option
> > for this?
> > 
> > Many thanks.
> > Sakil
> > 
> > 
> > 
> > --
> > View this message in context: 
> > http://apache-spark-user-list.1001560.n3.nabble.com/Spark-on-Mobile-platforms-tp26699.html
> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
> > 
> > -
> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> > For additional commands, e-mail: user-h...@spark.apache.org
> > 



Re: How to remove empty strings from JavaRDD

2016-04-07 Thread Chris Miller
flatmap?


--
Chris Miller

On Thu, Apr 7, 2016 at 10:25 PM, greg huang  wrote:

> Hi All,
>
>Can someone give me a example code to get rid of the empty string in
> JavaRDD? I kwon there is a filter method in JavaRDD:
> https://spark.apache.org/docs/1.6.0/api/java/org/apache/spark/rdd/RDD.html#filter(scala.Function1)
>
> Regards,
>Greg
>


How to remove empty strings from JavaRDD

2016-04-07 Thread greg huang
Hi All,

   Can someone give me a example code to get rid of the empty string in
JavaRDD? I kwon there is a filter method in JavaRDD:
https://spark.apache.org/docs/1.6.0/api/java/org/apache/spark/rdd/RDD.html#filter(scala.Function1)

Regards,
   Greg


RE: Spark on Mobile platforms

2016-04-07 Thread Ashic Mahtab
Spark may not be the right tool for this. Working on just the mobile device, 
you won't be scaling out stuff, and as such most of the benefits of Spark would 
be nullified. Moreover, it'd likely run slower than things that are meant to 
work in a single process. Spark is also quite large, which is another drawback 
in terms of mobile apps.
Perhaps check out Tensorflow, which may be better suited for this particular 
requirement.
-Ashic.

> Date: Thu, 7 Apr 2016 04:50:18 -0700
> From: sbarbhuiy...@qub.ac.uk
> To: user@spark.apache.org
> Subject: Spark on Mobile platforms
> 
> Hi all,
> 
> I have been trying to find if Spark can be run on a mobile device platform
> (Android preferably) to analyse mobile log data for some performance
> analysis. So, basically the idea is to collect and process the mobile log
> data within the mobile device using the Spark framework to allow real-time
> log data analysis, without offloading the data to remote server/cloud.
> 
> Does anybody have any idea about whether running Spark on a mobile platform
> is supported by the existing Spark framework or is there any other option
> for this?
> 
> Many thanks.
> Sakil
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-on-Mobile-platforms-tp26699.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
> 
  

difference between simple streaming and windows streaming in spark

2016-04-07 Thread Ashok Kumar
Is simple streaming mean continuous streaming and windows streaming time window?
val ssc = new StreamingContext(sparkConf, Seconds(10))
thanks

Spark on Mobile platforms

2016-04-07 Thread sakilQUB
Hi all,

I have been trying to find if Spark can be run on a mobile device platform
(Android preferably) to analyse mobile log data for some performance
analysis. So, basically the idea is to collect and process the mobile log
data within the mobile device using the Spark framework to allow real-time
log data analysis, without offloading the data to remote server/cloud.

Does anybody have any idea about whether running Spark on a mobile platform
is supported by the existing Spark framework or is there any other option
for this?

Many thanks.
Sakil



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-on-Mobile-platforms-tp26699.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: partition an empty RDD

2016-04-07 Thread Tenghuan He
Thanks for your response Owen:)
Yes, I define K as ClassTag type and it works.
Sorry for bothering.

On Thu, Apr 7, 2016 at 4:07 PM, Sean Owen  wrote:

> It means pretty much what it says. Your code does not have runtime
> class info about K at this point in your code, and it is required.
>
> On Thu, Apr 7, 2016 at 5:52 AM, Tenghuan He  wrote:
> > Hi all,
> >
> > I want to create an empty rdd and partition it
> >
> > val  buffer: RDD[(K, (V, Int))] = base.context.emptyRDD[(K, (V,
> > Int))].partitionBy(new HashPartitioner(5))
> > but got Error: No ClassTag available for K
> >
> > scala needs at runtime to have information about K , but how to solve
> this?
> >
> > Thanks in advance.
> >
> > Tenghuan
>


Research issues in Spark, Spark Streamming and MLlib

2016-04-07 Thread C.H
Hello,
I have a research project and I will be working on Spark to build something
on top of it or try to improve it.
I would like to know what are the research issues that we still have in
Spark Streamming , Mllib or Spark inorder to improve it.

Thanks in advance



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Research-issues-in-Spark-Spark-Streamming-and-MLlib-tp26698.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Develop locally with Yarn

2016-04-07 Thread Natu Lauchande
Hi,

I working on a spark streamming app , when in local i use the "local[*]" as
the master of my Spark Streamming Context .

I wonder what would be need to develop locally and run it in Yarn through
the IDE i am using IntelliJ idea.

Thanks,
Natu


Re: LabeledPoint with features in matrix form (word2vec matrix)

2016-04-07 Thread jamborta
depends, if you'd like to multiply matrices for each row in the data, then
you could use a breeze matrix, and do that locally on the nodes in a map or
similar.

if you'd like to multiply them across the rows, eg. a row in your data is a
row in the matrix, then you could use a distributed matrix like
indexedRowMatrix
(http://spark.apache.org/docs/latest/mllib-data-types.html#indexedrowmatrix).



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/LabeledPoint-with-features-in-matrix-form-word2vec-matrix-tp26629p26696.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: partition an empty RDD

2016-04-07 Thread Sean Owen
It means pretty much what it says. Your code does not have runtime
class info about K at this point in your code, and it is required.

On Thu, Apr 7, 2016 at 5:52 AM, Tenghuan He  wrote:
> Hi all,
>
> I want to create an empty rdd and partition it
>
> val  buffer: RDD[(K, (V, Int))] = base.context.emptyRDD[(K, (V,
> Int))].partitionBy(new HashPartitioner(5))
> but got Error: No ClassTag available for K
>
> scala needs at runtime to have information about K , but how to solve this?
>
> Thanks in advance.
>
> Tenghuan

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org