Re: PMML version in MLLib

2015-11-08 Thread Vincenzo Selvaggio
Hi,

I confirm the models are exported for PMML version 4.2, in fact you can see
in the generated xml
PMML xmlns="http://www.dmg.org/PMML-4_2;
This is the default version when using
https://github.com/jpmml/jpmml-model/tree/1.1.X.

I didn't realize the attribute version of the PMML root element was
required, this can be easily added.
Please, as Owen suggested, add a PR and link it to
https://issues.apache.org/jira/browse/SPARK-8545.

Thanks,
Vincenzo

On Wed, Nov 4, 2015 at 12:14 PM, Fazlan Nazeem  wrote:

> Thanks Owen. Will do it
>
> On Wed, Nov 4, 2015 at 5:22 PM, Sean Owen  wrote:
>
>> I'm pretty sure that attribute is required. I am not sure what PMML
>> version the code has been written for but would assume 4.2.1. Feel
>> free to open a PR to add this version to all the output.
>>
>> On Wed, Nov 4, 2015 at 11:42 AM, Fazlan Nazeem  wrote:
>> > [adding dev]
>> >
>> > On Wed, Nov 4, 2015 at 2:27 PM, Fazlan Nazeem  wrote:
>> >>
>> >> I just went through all specifications, and they expect the version
>> >> attribute. This should be addressed very soon because if we cannot use
>> the
>> >> PMML model without the version attribute, there is no use of
>> generating one
>> >> without it.
>> >>
>> >> On Wed, Nov 4, 2015 at 2:17 PM, Stefano Baghino
>> >>  wrote:
>> >>>
>> >>> I used KNIME, which internally uses the org.dmg.pmml library.
>> >>>
>> >>> On Wed, Nov 4, 2015 at 9:45 AM, Fazlan Nazeem 
>> wrote:
>> 
>>  Hi Stefano,
>> 
>>  Although the intention for my question wasn't as you expected, what
>> you
>>  say makes sense. The standard[1] for PMML 4.1 specifies that "For
>> PMML 4.1
>>  the attribute version must have the value 4.1". I'm not sure whether
>> that
>>  means that other PMML versions do not need that attribute to be set
>>  explicitly. I hope someone would answer this.
>> 
>>  What was the tool you used to load the PMML?
>> 
>>  [1] http://dmg.org/pmml/v4-1/GeneralStructure.html
>> 
>>
>
>
>
> --
> Thanks & Regards,
>
> Fazlan Nazeem
>
> *Software Engineer*
>
> *WSO2 Inc*
> Mobile : +94772338839
> <%2B94%20%280%29%20773%20451194>
> fazl...@wso2.com
>


Re: Spark Streaming updateStateByKey Implementation

2015-11-08 Thread Zoltán Zvara
It is implemented with cogroup. Basically it stores states in a separate
RDD and cogroups the target RDD with the state RDD, which is then hidden
from you. See StateDStream.scala, there is everything you need to know.

On Fri, Nov 6, 2015 at 6:25 PM Hien Luu  wrote:

> Hi,
>
> I am interested in learning about the implementation of updateStateByKey.
> Does anyone know of a jira or design doc I read?
>
> I did a quick search and couldn't find much info. on the implementation.
>
> Thanks in advance,
>
> Hien
>


Re: streaming: missing data. does saveAsTextFile() append or replace?

2015-11-08 Thread Gerard Maas
Andy,

Using the rdd.saveAsTextFile(...)  will overwrite the data if your target
is the same file.

If you want to save to HDFS, DStream offers dstream.saveAsTextFiles(prefix,
suffix)  where a new file will be written at each streaming interval.
Note that this will result in a saved file for each streaming interval. If
you want to increase the file size (usually a good idea in HDFS), you can
use a window function over the dstream and save the 'windowed'  dstream
instead.

kind regards, Gerard.

On Sat, Nov 7, 2015 at 10:55 PM, Andy Davidson <
a...@santacruzintegration.com> wrote:

> Hi
>
> I just started a new spark streaming project. In this phase of the system
> all we want to do is save the data we received to hdfs. I after running for
> a couple of days it looks like I am missing a lot of data. I wonder if
> saveAsTextFile("hdfs:///rawSteamingData”); is overwriting the data I
> capture in previous window? I noticed that after running for a couple of
> days  my hdfs file system has 25 file. The names are something like 
> “part-6”. I
> used 'hadoop fs –dus’ to check the total data captured. While the system
> was running I would periodically call ‘dus’ I was surprised sometimes the
> numbers of total bytes actually dropped.
>
>
> Is there a better way to save write my data to disk?
>
> Any suggestions would be appreciated
>
> Andy
>
>
>public static void main(String[] args) {
>
>SparkConf conf = new SparkConf().setAppName(appName);
>
> JavaSparkContext jsc = new JavaSparkContext(conf);
>
> JavaStreamingContext ssc = new JavaStreamingContext(jsc, new
> Duration(5 * 1000));
>
>
> [ deleted code …]
>
>
> data.foreachRDD(new Function(){
>
> private static final long serialVersionUID =
> -7957854392903581284L;
>
>
> @Override
>
> public Void call(JavaRDD jsonStr) throws Exception {
>
> jsonStr.saveAsTextFile("hdfs:///rawSteamingData”); // 
> /rawSteamingData
> is a directory
>
> return null;
>
> }
>
> });
>
>
>
> ssc.checkpoint(checkPointUri);
>
>
>
> ssc.start();
>
> ssc.awaitTermination();
>
> }
>


Re: Spark Job failing with exit status 15

2015-11-08 Thread Ted Yu
Which release of Spark were you using ?

Can you post the command you used to run WordCount ?

Cheers

On Sat, Nov 7, 2015 at 7:59 AM, Shashi Vishwakarma  wrote:

> I am trying to run simple word count job in spark but I am getting
> exception while running job.
>
> For more detailed output, check application tracking 
> page:http://quickstart.cloudera:8088/proxy/application_1446699275562_0006/Then,
>  click on links to logs of each attempt.Diagnostics: Exception from 
> container-launch.Container id: container_1446699275562_0006_02_01Exit 
> code: 15Stack trace: ExitCodeException exitCode=15:
> at org.apache.hadoop.util.Shell.runCommand(Shell.java:538)
> at org.apache.hadoop.util.Shell.run(Shell.java:455)
> at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:715)
> at 
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:211)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
> at java.util.concurrent.FutureTask.run(FutureTask.java:262)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
>
> Container exited with a non-zero exit code 15Failing this attempt. Failing 
> the application.
>  ApplicationMaster host: N/A
>  ApplicationMaster RPC port: -1
>  queue: root.cloudera
>  start time: 1446910483956
>  final status: FAILED
>  tracking URL: 
> http://quickstart.cloudera:8088/cluster/app/application_1446699275562_0006
>  user: clouderaException in thread "main" 
> org.apache.spark.SparkException: Application finished with failed status
> at org.apache.spark.deploy.yarn.Client.run(Client.scala:626)
> at org.apache.spark.deploy.yarn.Client$.main(Client.scala:651)
> at org.apache.spark.deploy.yarn.Client.main(Client.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:569)
> at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166)
> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
>
> I checked log from following command
>
> yarn logs -applicationId application_1446699275562_0006
>
> Here is log
>
>  15/11/07 07:35:09 ERROR yarn.ApplicationMaster: User class threw exception: 
> Output directory 
> hdfs://quickstart.cloudera:8020/user/cloudera/WordCountOutput already exists
> org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory 
> hdfs://quickstart.cloudera:8020/user/cloudera/WordCountOutput already exists
> at 
> org.apache.hadoop.mapred.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:132)
> at 
> org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:1053)
> at 
> org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:954)
> at 
> org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:863)
> at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1290)
> at org.com.td.sparkdemo.spark.WordCount$.main(WordCount.scala:23)
> at org.com.td.sparkdemo.spark.WordCount.main(WordCount.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:480)15/11/07
>  07:35:09 INFO yarn.ApplicationMaster: Final app status: FAILED, exitCode: 
> 15, (reason: User class threw exception: Output directory 
> hdfs://quickstart.cloudera:8020/user/cloudera/WordCountOutput already 
> exists)15/11/07 07:35:14 ERROR yarn.ApplicationMaster: SparkContext did not 
> initialize after waiting for 10 ms. Please check earlier log output for 
> errors. Failing the application.
>
> Exception clearly 

Connecting SparkR through Yarn

2015-11-08 Thread Amit Behera
Hi All,

Spark Version = 1.5.1
Hadoop Version = 2.6.0

I set up the cluster in Amazon EC2 machines (1+5)
I am able create a SparkContext object using *init* method from *RStudio.*

But I do not know how can I create a SparkContext object in *yarn mode.*

I got the below link to run on yarn. but in this document its given for *Spark
version >= 0.9.0 and <= 1.2.*


*https://github.com/amplab-extras/SparkR-pkg/blob/master/README.md#running-on-yarn
*


Please help me how can I connect SparkR on Yarn.



Thanks,
Amit.


Re: Spark Job failing with exit status 15

2015-11-08 Thread Shashi Vishwakarma
Hi

I am using Spark 1.3.0 . Command that I use is below.

/spark-submit --class org.com.td.sparkdemo.spark.WordCount \
--master yarn-cluster \
target/spark-0.0.1-SNAPSHOT-jar-with-dependencies.jar

Thanks
Shashi

On Sun, Nov 8, 2015 at 11:33 PM, Ted Yu  wrote:

> Which release of Spark were you using ?
>
> Can you post the command you used to run WordCount ?
>
> Cheers
>
> On Sat, Nov 7, 2015 at 7:59 AM, Shashi Vishwakarma <
> shashi.vish...@gmail.com> wrote:
>
>> I am trying to run simple word count job in spark but I am getting
>> exception while running job.
>>
>> For more detailed output, check application tracking 
>> page:http://quickstart.cloudera:8088/proxy/application_1446699275562_0006/Then,
>>  click on links to logs of each attempt.Diagnostics: Exception from 
>> container-launch.Container id: container_1446699275562_0006_02_01Exit 
>> code: 15Stack trace: ExitCodeException exitCode=15:
>> at org.apache.hadoop.util.Shell.runCommand(Shell.java:538)
>> at org.apache.hadoop.util.Shell.run(Shell.java:455)
>> at 
>> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:715)
>> at 
>> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:211)
>> at 
>> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
>> at 
>> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
>> at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>> at 
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>> at 
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>> at java.lang.Thread.run(Thread.java:745)
>>
>> Container exited with a non-zero exit code 15Failing this attempt. Failing 
>> the application.
>>  ApplicationMaster host: N/A
>>  ApplicationMaster RPC port: -1
>>  queue: root.cloudera
>>  start time: 1446910483956
>>  final status: FAILED
>>  tracking URL: 
>> http://quickstart.cloudera:8088/cluster/app/application_1446699275562_0006
>>  user: clouderaException in thread "main" 
>> org.apache.spark.SparkException: Application finished with failed status
>> at org.apache.spark.deploy.yarn.Client.run(Client.scala:626)
>> at org.apache.spark.deploy.yarn.Client$.main(Client.scala:651)
>> at org.apache.spark.deploy.yarn.Client.main(Client.scala)
>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>> at 
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>> at 
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>> at java.lang.reflect.Method.invoke(Method.java:606)
>> at 
>> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:569)
>> at 
>> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166)
>> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189)
>> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110)
>> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
>>
>> I checked log from following command
>>
>> yarn logs -applicationId application_1446699275562_0006
>>
>> Here is log
>>
>>  15/11/07 07:35:09 ERROR yarn.ApplicationMaster: User class threw exception: 
>> Output directory 
>> hdfs://quickstart.cloudera:8020/user/cloudera/WordCountOutput already exists
>> org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory 
>> hdfs://quickstart.cloudera:8020/user/cloudera/WordCountOutput already exists
>> at 
>> org.apache.hadoop.mapred.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:132)
>> at 
>> org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:1053)
>> at 
>> org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:954)
>> at 
>> org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:863)
>> at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1290)
>> at org.com.td.sparkdemo.spark.WordCount$.main(WordCount.scala:23)
>> at org.com.td.sparkdemo.spark.WordCount.main(WordCount.scala)
>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>> at 
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>> at 
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>> at java.lang.reflect.Method.invoke(Method.java:606)
>> at 
>> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:480)15/11/07
>>  07:35:09 INFO 

Broadcast Variables not showing inside Partitions Apache Spark

2015-11-08 Thread prajwol sangat
Hi All,

I am facing a weird situation which is explained below.

Scenario and Problem: I want to add two attributes to JSON object based on the 
look up table values and insert the JSON to Mongo DB. I have broadcast variable 
which holds look up table. However, i am not being able to access it inside 
foreachPartition as you can see in the code. It does not give me any error but 
simply does not display anything. Also, because of it i cant insert JSON to 
Mongo DB. I cant find any explanation to this behaviour. Any explanation or 
work around to make it work is much appreciated.

Note: I am using spark streaming and the file comes as micro batch to HDFS. I 
want to be able to use more cores in spark streaming (i have 16 cores 
available) so that processing can be done faster. 

Thanks.

Regards,
Prajwol

Here is my full code:

object ProcessMicroBatchStreams {
val calculateDistance = udf { 
 (lat: String, lon: String) =>  
 GeoHash.getDistance(lat.toDouble, lon.toDouble) }
 val DB_NAME = "IRT"
 val COLLECTION_NAME = "sensordata"
 val records = Array[String]()

def main(args: Array[String]): Unit = {
  if (args.length < 0) {
  System.err.println("Usage: ProcessMicroBatchStreams  
")
  System.exit(1)
}
val conf = new SparkConf()
  .setMaster("local[*]")
  .setAppName(this.getClass.getCanonicalName)
  .set("spark.hadoop.validateOutputSpecs", "false")
/*.set("spark.executor.instances", "3")
.set("spark.executor.memory", "18g")
.set("spark.executor.cores", "9")
.set("spark.task.cpus", "1")
.set("spark.driver.memory", "10g")*/

val sc = new SparkContext(conf)
val ssc = new StreamingContext(sc, Seconds(60))
val sqc = new SQLContext(sc)
val gpsLookUpTable = MapInput.cacheMappingTables(sc, 
sqc).persist(StorageLevel.MEMORY_AND_DISK_SER_2)
val broadcastTable = sc.broadcast(gpsLookUpTable)


ssc.textFileStream("hdfs://localhost:9000/inputDirectory/")
  .foreachRDD { rdd =>
  //broadcastTable.value.show() // I can access broadcast value here
  if (!rdd.partitions.isEmpty) {
val partitionedRDD = rdd.repartition(4)
partitionedRDD.foreachPartition {
  partition =>
println("Inside Partition")
broadcastTable.value.show() // I cannot access broadcast value here
partition.foreach {
  row =>
val items = row.split("\n")
items.foreach { item =>
  val mongoColl = MongoClient()(DB_NAME)(COLLECTION_NAME)
  val jsonObject = new JSONObject(item)
  val latitude = jsonObject.getDouble(Constants.LATITUDE)
  val longitude = jsonObject.getDouble(Constants.LONGITUDE)

  // The broadcast value is not being shown here
  // However, there is no error shown
  // I cannot insert the value into Mongo DB
  val selectedRow = broadcastTable.value
.filter("geoCode LIKE '" + GeoHash.subString(latitude, 
longitude) + "%'")
.withColumn("Distance", calculateDistance(col("Lat"), 
col("Lon")))
.orderBy("Distance")
.select(Constants.TRACK_KM, Constants.TRACK_NAME).take(1)
  if (selectedRow.length != 0) {
jsonObject.put(Constants.TRACK_KM, selectedRow(0).get(0))
jsonObject.put(Constants.TRACK_NAME, selectedRow(0).get(1))
  }
  else {
jsonObject.put(Constants.TRACK_KM, "NULL")
jsonObject.put(Constants.TRACK_NAME, "NULL")
  }
  val record = 
JSON.parse(jsonObject.toString()).asInstanceOf[DBObject]
  mongoColl.insert(record)
}
}
}
  }
}
sys.addShutdownHook {
  ssc.stop(true, true)
}

ssc.start()
ssc.awaitTermination()
}
}



Re: Scheduling Spark process

2015-11-08 Thread Hitoshi Ozawa
I'm not getting your question about scheduling. Did you create a Spark
application and asking how to schedule it to run? Are you going to output
results from the scheduled run in hdfs and join them in the first chain with
the real time result?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Scheduling-Spark-process-tp25287p25317.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Does the Standalone cluster and Applications need to be same Spark version?

2015-11-08 Thread Hitoshi Ozawa
I think it depends on the versions. Using something like 0.9.2 and 1.5.1
isn't recommended.
1.5.0 and 1.5.1 is a minor bug release so I think most will work but some
feature may behave differently  so it's better to use the same revision.
Changes between versions/releases are listed in CHANGES.txt file included
with the download.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Does-the-Standalone-cluster-and-Applications-need-to-be-same-Spark-version-tp25255p25318.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



How to use --principal and --keytab in SparkSubmit

2015-11-08 Thread Todd
Hi,
I am staring spark thrift server with the following script,
./start-thriftserver.sh --master yarn-client --driver-memory 1G 
--executor-memory 2G --driver-cores 2 --executor-cores 2 --num-executors 4 
--hiveconf hive.server2.thrift.port=10001 --hiveconf 
hive.server2.thrift.bind.host=$(hostname) --principal hdfs/_h...@hadoop.xx  
--keytab /export/keytabs_conf/hdfs.keytab

but an exception throws complaining Unable to obtain password from user

Exception in thread "main" java.io.IOException: Login failure for 
hdfs/_h...@hadoop.xx from keytab /export/keytabs_conf/hdfs.keytab: 
javax.security.auth.login.LoginException: Unable to obtain password from user

at 
org.apache.hadoop.security.UserGroupInformation.loginUserFromKeytab(UserGroupInformation.java:935)
at 
org.apache.spark.deploy.SparkSubmit$.prepareSubmitEnvironment(SparkSubmit.scala:523)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:153)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)

I would ask how to work with these two parameters to access the secured hdfs 
with Kerberos. Thanks!


passing RDDs/DataFrames as arguments to functions - what happens?

2015-11-08 Thread Kristina Rogale Plazonic
Hi,

I thought I understood RDDs and DataFrames, but one noob thing is bugging
me (because I'm seeing weird errors involving joins):

*What does Spark do when you pass a big dataframe as an argument to a
function? *

Are these dataframes included in the closure of the function, and is
therefore each big argument dataframe shipped off to each node, instead of
respecting the locality of the distributed data of the dataframe?

Or, are the *references* to these distributed objects included in the
closure, and not the objects themselves? I was assuming the latter is true,
but am not sure anymore.

For example, I wrote a function called myLeftJoin(df1, df2, Seq(columns))
that merges df1 and df2 based on multiple columns present in both, but I
don't want an equijoin, I want a left join and I don't want repeated
columns. Inside the function I build an sql statement to execute. Will this
construct be inefficient? Will it exhaust memory somehow due to passing df1
and df2 in the function?

I had instances of when calling a function in spark-shell would produce an
error "unable to acquire  bytes of memory", but subsequently, not
wrapping the code in a function, and instead pasting the function body in
the spark-shell would not produce the same error. So, does calling a
function in Spark include a memory overhead somehow?

Thanks for any clarifications!

Kristina


Re: visualizations using the apache spark

2015-11-08 Thread Hitoshi Ozawa
You can save the result to a storage (e.g. Hive) and have a web application
read data from that. 
I think there's also a "toJSON" method to convert Dataset to JSON.
Another option is to use something like Spark Kernel with Spark
sc(https://github.com/ibm-et/spark-kernel/wiki)
Another choice is to use something like Apache Zeppelin
(https://zeppelin.incubator.apache.org/)



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/visualizations-using-the-apache-spark-tp25246p25323.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Clustering of Words

2015-11-08 Thread Deep Pradhan
Hi,
I am trying to cluster words of some articles. I used TFIDF and Word2Vec in
Spark to get the vector for each word and I used KMeans to cluster the
words. Now, is there any way to get back the words from the vectors? I want
to know what words are there in each cluster.
I am aware that TFIDF does not have an inverse. Does anyone know how to get
back the words from the clusters?

Thank You
Regards,
Deep


OLAP query using spark dataframe with cassandra

2015-11-08 Thread fightf...@163.com
Hi, community

We are specially interested about this featural integration according to some 
slides from [1]. The SMACK(Spark+Mesos+Akka+Cassandra+Kafka)

seems good implementation for lambda architecure in the open-source world, 
especially non-hadoop based cluster environment. As we can see, 

the advantages obviously consist of :

1 the feasibility and scalability of spark datafram api, which can also make a 
perfect complement for Apache Cassandra native cql feature.

2 both streaming and batch process availability using the ALL-STACK thing, cool.

3 we can both achieve compacity and usability for spark with cassandra, 
including seemlessly integrating with job scheduling and resource management.

Only one concern goes to the OLAP query performance issue, which mainly caused 
by frequent aggregation work between daily increased large tables, for 

both spark sql and cassandra. I can see that the [1] use case facilitates 
FiloDB to achieve columnar storage and query performance, but we had nothing 
more 

knowledge. 

Question is : Any guy had such use case for now, especially using in your 
production environment ? Would be interested in your architeture for designing 
this 

OLAP engine using spark +  cassandra. What do you think the comparison between 
the scenario with traditional OLAP cube design? Like Apache Kylin or 

pentaho mondrian ? 

Best Regards,

Sun.


[1]  
http://www.slideshare.net/planetcassandra/cassandra-summit-2014-interactive-olap-queries-using-apache-cassandra-and-spark



fightf...@163.com


Re: Is SPARK is the right choice for traditional OLAP query processing?

2015-11-08 Thread Hitoshi Ozawa
It depends on how much data needs to be processed. Data Warehouse with
indexes is going to be faster when there is not much data. If you have big
data, Spark Streaming and may be Spark SQL may interest you.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Is-SPARK-is-the-right-choice-for-traditional-OLAP-query-processing-tp23921p25319.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: why prebuild spark 1.5.1 still say Failed to find Spark assembly in

2015-11-08 Thread Hitoshi Ozawa
Are you sure you downloaded the pre-build version? The default is source
build package.
Please check if the file you've downloaded starts "spark-1.5.1-bin-" with a
"bin".



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/why-prebuild-spark-1-5-1-still-say-Failed-to-find-Spark-assembly-in-tp25234p25321.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: PMML version in MLLib

2015-11-08 Thread Fazlan Nazeem
Hi Vincenzo/Owen,

I have sent a pull request[1] with necessary changes to add the pmml
version attribute to the root node. I have also linked the issue under the
PMML improvement umbrella[2] as you suggested.

[1] https://github.com/apache/spark/pull/9558
[2]  https://issues.apache.org/jira/browse/SPARK-8545.

On Sun, Nov 8, 2015 at 2:38 PM, Vincenzo Selvaggio 
wrote:

> Hi,
>
> I confirm the models are exported for PMML version 4.2, in fact you can
> see in the generated xml
> PMML xmlns="http://www.dmg.org/PMML-4_2;
> This is the default version when using
> https://github.com/jpmml/jpmml-model/tree/1.1.X.
>
> I didn't realize the attribute version of the PMML root element was
> required, this can be easily added.
> Please, as Owen suggested, add a PR and link it to
> https://issues.apache.org/jira/browse/SPARK-8545.
>
> Thanks,
> Vincenzo
>
> On Wed, Nov 4, 2015 at 12:14 PM, Fazlan Nazeem  wrote:
>
>> Thanks Owen. Will do it
>>
>> On Wed, Nov 4, 2015 at 5:22 PM, Sean Owen  wrote:
>>
>>> I'm pretty sure that attribute is required. I am not sure what PMML
>>> version the code has been written for but would assume 4.2.1. Feel
>>> free to open a PR to add this version to all the output.
>>>
>>> On Wed, Nov 4, 2015 at 11:42 AM, Fazlan Nazeem  wrote:
>>> > [adding dev]
>>> >
>>> > On Wed, Nov 4, 2015 at 2:27 PM, Fazlan Nazeem 
>>> wrote:
>>> >>
>>> >> I just went through all specifications, and they expect the version
>>> >> attribute. This should be addressed very soon because if we cannot
>>> use the
>>> >> PMML model without the version attribute, there is no use of
>>> generating one
>>> >> without it.
>>> >>
>>> >> On Wed, Nov 4, 2015 at 2:17 PM, Stefano Baghino
>>> >>  wrote:
>>> >>>
>>> >>> I used KNIME, which internally uses the org.dmg.pmml library.
>>> >>>
>>> >>> On Wed, Nov 4, 2015 at 9:45 AM, Fazlan Nazeem 
>>> wrote:
>>> 
>>>  Hi Stefano,
>>> 
>>>  Although the intention for my question wasn't as you expected, what
>>> you
>>>  say makes sense. The standard[1] for PMML 4.1 specifies that "For
>>> PMML 4.1
>>>  the attribute version must have the value 4.1". I'm not sure
>>> whether that
>>>  means that other PMML versions do not need that attribute to be set
>>>  explicitly. I hope someone would answer this.
>>> 
>>>  What was the tool you used to load the PMML?
>>> 
>>>  [1] http://dmg.org/pmml/v4-1/GeneralStructure.html
>>> 
>>>
>>
>>
>>
>> --
>> Thanks & Regards,
>>
>> Fazlan Nazeem
>>
>> *Software Engineer*
>>
>> *WSO2 Inc*
>> Mobile : +94772338839
>> <%2B94%20%280%29%20773%20451194>
>> fazl...@wso2.com
>>
>
>


-- 
Thanks & Regards,

Fazlan Nazeem

*Software Engineer*

*WSO2 Inc*
Mobile : +94772338839
<%2B94%20%280%29%20773%20451194>
fazl...@wso2.com


Re: OLAP query using spark dataframe with cassandra

2015-11-08 Thread Jörn Franke

Is there any distributor supporting these software components in combination? 
If no and your core business is not software then you may want to look for 
something else, because it might not make sense to build up internal know-how 
in all of these areas.

In any case - it depends all highly on your data and queries. You will have to 
do your own experiments.

> On 09 Nov 2015, at 07:02, "fightf...@163.com"  wrote:
> 
> Hi, community
> 
> We are specially interested about this featural integration according to some 
> slides from [1]. The SMACK(Spark+Mesos+Akka+Cassandra+Kafka)
> 
> seems good implementation for lambda architecure in the open-source world, 
> especially non-hadoop based cluster environment. As we can see, 
> 
> the advantages obviously consist of :
> 
> 1 the feasibility and scalability of spark datafram api, which can also make 
> a perfect complement for Apache Cassandra native cql feature.
> 
> 2 both streaming and batch process availability using the ALL-STACK thing, 
> cool.
> 
> 3 we can both achieve compacity and usability for spark with cassandra, 
> including seemlessly integrating with job scheduling and resource management.
> 
> Only one concern goes to the OLAP query performance issue, which mainly 
> caused by frequent aggregation work between daily increased large tables, for 
> 
> both spark sql and cassandra. I can see that the [1] use case facilitates 
> FiloDB to achieve columnar storage and query performance, but we had nothing 
> more 
> 
> knowledge. 
> 
> Question is : Any guy had such use case for now, especially using in your 
> production environment ? Would be interested in your architeture for 
> designing this 
> 
> OLAP engine using spark +  cassandra. What do you think the comparison 
> between the scenario with traditional OLAP cube design? Like Apache Kylin or 
> 
> pentaho mondrian ? 
> 
> Best Regards,
> 
> Sun.
> 
> 
> [1]  
> http://www.slideshare.net/planetcassandra/cassandra-summit-2014-interactive-olap-queries-using-apache-cassandra-and-spark
> 
> fightf...@163.com


Unwanted SysOuts in Spark Parquet

2015-11-08 Thread swetha
Hi,

I see a lot of unwanted SysOuts when I try to save an RDD as parquet file.
Following is the code and 
SysOuts. Any idea as to how to avoid the unwanted SysOuts?


ParquetOutputFormat.setWriteSupportClass(job, classOf[AvroWriteSupport])

AvroParquetOutputFormat.setSchema(job, ActiveSession.SCHEMA$)
activeSessionsToBeSaved.saveAsNewAPIHadoopFile("test", classOf[Void],
classOf[ActiveSession],
  classOf[ParquetOutputFormat[ActiveSession]], job.getConfiguration)

Nov 8, 2015 11:35:33 PM INFO: parquet.hadoop.codec.CodecConfig: Compression
set to false
Nov 8, 2015 11:35:33 PM INFO: parquet.hadoop.codec.CodecConfig: Compression:
UNCOMPRESSED
Nov 8, 2015 11:35:33 PM INFO: parquet.hadoop.ParquetOutputFormat: Parquet
block size to 134217728
Nov 8, 2015 11:35:33 PM INFO: parquet.hadoop.ParquetOutputFormat: Parquet
page size to 1048576
Nov 8, 2015 11:35:33 PM INFO: parquet.hadoop.ParquetOutputFormat: Parquet
dictionary page size to 1048576
Nov 8, 2015 11:35:33 PM INFO: parquet.hadoop.ParquetOutputFormat: Dictionary
is on
Nov 8, 2015 11:35:33 PM INFO: parquet.hadoop.ParquetOutputFormat: Validation
is off
Nov 8, 2015 11:35:33 PM INFO: parquet.hadoop.ParquetOutputFormat: Writer
version is: PARQUET_1_0
Nov 8, 2015 11:35:33 PM INFO: parquet.hadoop.InternalParquetRecordWriter:
Flushing mem columnStore to file. allocated memory: 29,159,377
Nov 8, 2015 11:35:33 PM INFO: parquet.hadoop.codec.CodecConfig: Compression
set to false
Nov 8, 2015 11:35:33 PM INFO: parquet.hadoop.codec.CodecConfig: Compression:
UNCOMPRESSED
Nov 8, 2015 11:35:33 PM INFO: parquet.hadoop.ParquetOutputFormat: Parquet
block size to 134217728
Nov 8, 2015 11:35:33 PM INFO: parquet.hadoop.ParquetOutputFormat: Parquet
page size to 1048576
Nov 8, 2015 11:35:33 PM INFO: parquet.hadoop.ParquetOutputFormat: Parquet
dictionary page size to 1048576
Nov 8, 2015 11:35:33 PM INFO: parquet.hadoop.ParquetOutputFormat: Dictionary
is on
Nov 8, 2015 11:35:33 PM INFO: parquet.hadoop.ParquetOutputFormat: Validation
is off



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Unwanted-SysOuts-in-Spark-Parquet-tp25325.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Re: OLAP query using spark dataframe with cassandra

2015-11-08 Thread fightf...@163.com
Hi, 

Thanks for suggesting. Actually we are now evaluating and stressing the spark 
sql on cassandra, while

trying to define business models. FWIW, the solution mentioned here is 
different from traditional OLAP

cube engine, right ? So we are hesitating on the common sense or direction 
choice of olap architecture. 

And we are happy to hear more use case from this community. 

Best,
Sun. 



fightf...@163.com
 
From: Jörn Franke
Date: 2015-11-09 14:40
To: fightf...@163.com
CC: user; dev
Subject: Re: OLAP query using spark dataframe with cassandra

Is there any distributor supporting these software components in combination? 
If no and your core business is not software then you may want to look for 
something else, because it might not make sense to build up internal know-how 
in all of these areas.

In any case - it depends all highly on your data and queries. You will have to 
do your own experiments.

On 09 Nov 2015, at 07:02, "fightf...@163.com"  wrote:

Hi, community

We are specially interested about this featural integration according to some 
slides from [1]. The SMACK(Spark+Mesos+Akka+Cassandra+Kafka)

seems good implementation for lambda architecure in the open-source world, 
especially non-hadoop based cluster environment. As we can see, 

the advantages obviously consist of :

1 the feasibility and scalability of spark datafram api, which can also make a 
perfect complement for Apache Cassandra native cql feature.

2 both streaming and batch process availability using the ALL-STACK thing, cool.

3 we can both achieve compacity and usability for spark with cassandra, 
including seemlessly integrating with job scheduling and resource management.

Only one concern goes to the OLAP query performance issue, which mainly caused 
by frequent aggregation work between daily increased large tables, for 

both spark sql and cassandra. I can see that the [1] use case facilitates 
FiloDB to achieve columnar storage and query performance, but we had nothing 
more 

knowledge. 

Question is : Any guy had such use case for now, especially using in your 
production environment ? Would be interested in your architeture for designing 
this 

OLAP engine using spark +  cassandra. What do you think the comparison between 
the scenario with traditional OLAP cube design? Like Apache Kylin or 

pentaho mondrian ? 

Best Regards,

Sun.


[1]  
http://www.slideshare.net/planetcassandra/cassandra-summit-2014-interactive-olap-queries-using-apache-cassandra-and-spark



fightf...@163.com


PySpark: cannot convert float infinity to integer, when setting batch in add_shuffle_key

2015-11-08 Thread trsell
Hello,

I am running spark 1.5.1 on EMR using Python 3.
I have a pyspark job which is doing some simple joins and reduceByKey
operations. It works fine most of the time, but sometimes I get the
following error:

15/11/09 03:00:53 WARN TaskSetManager: Lost task 2.0 in stage 4.0 (TID
69, ip-172-31-8-142.ap-southeast-1.compute.internal):
org.apache.spark.api.python.PythonException: Traceback (most recent
call last):
  File 
"/mnt/yarn/usercache/hadoop/appcache/application_1447027912929_0004/container_1447027912929_0004_01_03/pyspark.zip/pyspark/worker.py",
line 111, in main
process()
  File 
"/mnt/yarn/usercache/hadoop/appcache/application_1447027912929_0004/container_1447027912929_0004_01_03/pyspark.zip/pyspark/worker.py",
line 106, in process
serializer.dump_stream(func(split_index, iterator), outfile)
  File 
"/mnt/yarn/usercache/hadoop/appcache/application_1447027912929_0004/container_1447027912929_0004_01_03/pyspark.zip/pyspark/serializers.py",
line 133, in dump_stream
for obj in iterator:
  File "/usr/lib/spark/python/pyspark/rdd.py", line 1723, in add_shuffle_key
OverflowError: cannot convert float infinity to integer

at 
org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:138)
at 
org.apache.spark.api.python.PythonRDD$$anon$1.next(PythonRDD.scala:101)
at 
org.apache.spark.api.python.PythonRDD$$anon$1.next(PythonRDD.scala:97)
at 
org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:43)
at 
scala.collection.Iterator$GroupedIterator.takeDestructively(Iterator.scala:914)
at scala.collection.Iterator$GroupedIterator.go(Iterator.scala:929)
at scala.collection.Iterator$GroupedIterator.fill(Iterator.scala:969)
at scala.collection.Iterator$GroupedIterator.hasNext(Iterator.scala:972)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at 
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.insertAll(BypassMergeSortShuffleWriter.java:118)
at 
org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:73)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)


The line in question is:
https://github.com/apache/spark/blob/4f894dd6906311cb57add6757690069a18078783/python/pyspark/rdd.py#L1723

I'm having a hard time seeing how `batch` could ever be set to infinity.
The error it is also inconsistent to reproduce.

Help?


Re: Spark Job failing with exit status 15

2015-11-08 Thread Deng Ching-Mallete
Hi Shashi,

It's possible that the logs you were seeing is the log for the second
attempt. By default, I think yarn is configured to re-attempt executing the
job again if it fails the first time. Try checking the application logs
from the Yarn RM UI, make sure that you click the first log attempt as that
might show the actual cause of the failure.

HTH,
Deng

On Mon, Nov 9, 2015 at 2:11 AM, Shashi Vishwakarma  wrote:

> Hi
>
> I am using Spark 1.3.0 . Command that I use is below.
>
> /spark-submit --class org.com.td.sparkdemo.spark.WordCount \
> --master yarn-cluster \
> target/spark-0.0.1-SNAPSHOT-jar-with-dependencies.jar
>
> Thanks
> Shashi
>
> On Sun, Nov 8, 2015 at 11:33 PM, Ted Yu  wrote:
>
>> Which release of Spark were you using ?
>>
>> Can you post the command you used to run WordCount ?
>>
>> Cheers
>>
>> On Sat, Nov 7, 2015 at 7:59 AM, Shashi Vishwakarma <
>> shashi.vish...@gmail.com> wrote:
>>
>>> I am trying to run simple word count job in spark but I am getting
>>> exception while running job.
>>>
>>> For more detailed output, check application tracking 
>>> page:http://quickstart.cloudera:8088/proxy/application_1446699275562_0006/Then,
>>>  click on links to logs of each attempt.Diagnostics: Exception from 
>>> container-launch.Container id: container_1446699275562_0006_02_01Exit 
>>> code: 15Stack trace: ExitCodeException exitCode=15:
>>> at org.apache.hadoop.util.Shell.runCommand(Shell.java:538)
>>> at org.apache.hadoop.util.Shell.run(Shell.java:455)
>>> at 
>>> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:715)
>>> at 
>>> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:211)
>>> at 
>>> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
>>> at 
>>> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
>>> at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>>> at 
>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>> at 
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>> at java.lang.Thread.run(Thread.java:745)
>>>
>>> Container exited with a non-zero exit code 15Failing this attempt. Failing 
>>> the application.
>>>  ApplicationMaster host: N/A
>>>  ApplicationMaster RPC port: -1
>>>  queue: root.cloudera
>>>  start time: 1446910483956
>>>  final status: FAILED
>>>  tracking URL: 
>>> http://quickstart.cloudera:8088/cluster/app/application_1446699275562_0006
>>>  user: clouderaException in thread "main" 
>>> org.apache.spark.SparkException: Application finished with failed status
>>> at org.apache.spark.deploy.yarn.Client.run(Client.scala:626)
>>> at org.apache.spark.deploy.yarn.Client$.main(Client.scala:651)
>>> at org.apache.spark.deploy.yarn.Client.main(Client.scala)
>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>> at 
>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>>> at 
>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>> at java.lang.reflect.Method.invoke(Method.java:606)
>>> at 
>>> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:569)
>>> at 
>>> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166)
>>> at 
>>> org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189)
>>> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110)
>>> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
>>>
>>> I checked log from following command
>>>
>>> yarn logs -applicationId application_1446699275562_0006
>>>
>>> Here is log
>>>
>>>  15/11/07 07:35:09 ERROR yarn.ApplicationMaster: User class threw 
>>> exception: Output directory 
>>> hdfs://quickstart.cloudera:8020/user/cloudera/WordCountOutput already exists
>>> org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory 
>>> hdfs://quickstart.cloudera:8020/user/cloudera/WordCountOutput already exists
>>> at 
>>> org.apache.hadoop.mapred.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:132)
>>> at 
>>> org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:1053)
>>> at 
>>> org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:954)
>>> at 
>>> org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:863)
>>> at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1290)
>>> at org.com.td.sparkdemo.spark.WordCount$.main(WordCount.scala:23)
>>>

Re: sqlCtx.sql('some_hive_table') works in pyspark but not spark-submit

2015-11-08 Thread Deng Ching-Mallete
Hi,

Did you check if HADOOP_CONF_DIR is configured in your YARN's application
classpath? By default, the shell runs in local client mode which is
probably why it's resolving the env variable you're setting and was able to
get the Hive metastore from your hive-site.xml..

HTH,
Deng

On Sun, Nov 8, 2015 at 6:12 AM, YaoPau  wrote:

> Within a pyspark shell, both of these work for me:
>
> print hc.sql("SELECT * from raw.location_tbl LIMIT 10").collect()
> print sqlCtx.sql("SELECT * from raw.location_tbl LIMIT 10").collect()
>
> But when I submit both of those in batch mode (hc and sqlCtx both exist), I
> get the following error.  Why is this happening?  I'll note that I'm
> running
> on YARN (CDH) and connecting to the Hive Metastore by setting an
> environment
> variable with export HADOOP_CONF_DIR=/etc/hive/conf/
>
> An error occurred while calling o39.sql.
> : java.lang.RuntimeException: Table Not Found: raw.location_tbl
> at scala.sys.package$.error(package.scala:27)
> at
>
> org.apache.spark.sql.catalyst.analysis.SimpleCatalog$$anonfun$1.apply(Catalog.scala:111)
> at
>
> org.apache.spark.sql.catalyst.analysis.SimpleCatalog$$anonfun$1.apply(Catalog.scala:111)
> at scala.collection.MapLike$class.getOrElse(MapLike.scala:128)
> at scala.collection.AbstractMap.getOrElse(Map.scala:58)
> at
>
> org.apache.spark.sql.catalyst.analysis.SimpleCatalog.lookupRelation(Catalog.scala:111)
> at
>
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.getTable(Analyzer.scala:175)
> at
>
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$6.applyOrElse(Analyzer.scala:187)
> at
>
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$6.applyOrElse(Analyzer.scala:182)
> at
>
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:187)
> at
>
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:187)
> at
>
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:50)
> at
>
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:186)
> at
>
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:207)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> at scala.collection.Iterator$class.foreach(Iterator.scala:727)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
> at
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
> at
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
> at
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
> at scala.collection.TraversableOnce$class.to
> (TraversableOnce.scala:273)
> at scala.collection.AbstractIterator.to(Iterator.scala:1157)
> at
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
> at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
> at
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
> at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
> at
>
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildrenDown(TreeNode.scala:236)
> at
>
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:192)
> at
>
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:207)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> at scala.collection.Iterator$class.foreach(Iterator.scala:727)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
> at
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
> at
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
> at
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
> at scala.collection.TraversableOnce$class.to
> (TraversableOnce.scala:273)
> at scala.collection.AbstractIterator.to(Iterator.scala:1157)
> at
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
> at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
> at
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
> at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
> at
>
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildrenDown(TreeNode.scala:236)
> at
>
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:192)
> at
> org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:177)
> at
>
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.apply(Analyzer.scala:182)
> at
>
> 

Re: How to analyze weather data in Spark?

2015-11-08 Thread ayan guha
Hi

Is it possible to elaborate a little more?

In order to consume a fixed width file, the standard process should be

1. Write a map function which takes input as a string and implement file
specs to return tuple of fields.
2. Load the files using sc.textFile (which reads the lines as string)
3. Pass on the lines to map and get back a RDD of fields.

Ayan

On Mon, Nov 9, 2015 at 3:20 PM, Hitoshi Ozawa  wrote:

> There's a document describing the format of files in the parent directory.
> It
> seems like a fixed width file.
> ftp://ftp.ncdc.noaa.gov/pub/data/noaa/ish-format-document.pdf
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-analyze-weather-data-in-Spark-tp25256p25320.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


-- 
Best Regards,
Ayan Guha


Re: Is SPARK is the right choice for traditional OLAP query processing?

2015-11-08 Thread chandan prakash
Apache Drill is also a very good candidate for this.



On Mon, Nov 9, 2015 at 9:33 AM, Hitoshi Ozawa  wrote:

> It depends on how much data needs to be processed. Data Warehouse with
> indexes is going to be faster when there is not much data. If you have big
> data, Spark Streaming and may be Spark SQL may interest you.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Is-SPARK-is-the-right-choice-for-traditional-OLAP-query-processing-tp23921p25319.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


-- 
Chandan Prakash


java.lang.ClassNotFoundException: org.apache.spark.streaming.twitter.TwitterReceiver

2015-11-08 Thread fanooos
This is my first Spark Stream application. The setup is as following

3 nodes running a spark cluster. One master node and two slaves.

The application is a simple java application streaming from twitter and
dependencies managed by maven.

Here is the code of the application

public class SimpleApp {

public static void main(String[] args) {

SparkConf conf = new SparkConf().setAppName("Simple
Application").setMaster("spark://rethink-node01:7077");

JavaStreamingContext sc = new JavaStreamingContext(conf, new
Duration(1000));

ConfigurationBuilder cb = new ConfigurationBuilder();

cb.setDebugEnabled(true).setOAuthConsumerKey("ConsumerKey")
.setOAuthConsumerSecret("ConsumerSecret")
.setOAuthAccessToken("AccessToken")
.setOAuthAccessTokenSecret("TokenSecret");

OAuthAuthorization auth = new OAuthAuthorization(cb.build());

JavaDStream tweets = TwitterUtils.createStream(sc, auth);

 JavaDStream statuses = tweets.map(new Function() {
 public String call(Status status) throws Exception {
return status.getText();
}
});

 statuses.print();;

 sc.start();

 sc.awaitTermination();

}

}


here is the pom file

http://maven.apache.org/POM/4.0.0;
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance;
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0
http://maven.apache.org/xsd/maven-4.0.0.xsd;>
4.0.0
SparkFirstTry
SparkFirstTry
0.0.1-SNAPSHOT


 
org.apache.spark
spark-core_2.10
1.5.1
provided



org.apache.spark
spark-streaming_2.10
1.5.1
provided



org.twitter4j
twitter4j-stream
3.0.3


org.apache.spark
spark-streaming-twitter_2.10
1.0.0





src


maven-compiler-plugin
3.3

1.8
1.8



maven-assembly-plugin



   
com.test.sparkTest.SimpleApp



jar-with-dependencies









The application starts successfully but no tweets comes and this exception
is thrown

15/11/08 15:55:46 WARN TaskSetManager: Lost task 0.0 in stage 4.0 (TID 78,
192.168.122.39): java.io.IOException: java.lang.ClassNotFoundException:
org.apache.spark.streaming.twitter.TwitterReceiver
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1163)
at
org.apache.spark.rdd.ParallelCollectionPartition.readObject(ParallelCollectionRDD.scala:70)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at
java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1058)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1900)
at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
at
org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:72)
at
org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:98)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:194)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ClassNotFoundException:
org.apache.spark.streaming.twitter.TwitterReceiver
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at
org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:67)
at

Wrap an RDD with a ShuffledRDD

2015-11-08 Thread Muhammad Haseeb Javed
I am working on a modified Spark core and have a Broadcast variable which I
deserialize to obtain an RDD along with its set of dependencies, as is done
in ShuffleMapTask, as following:

val taskBinary: Broadcast[Array[Byte]]var (rdd, dep) =
ser.deserialize[(RDD[_], ShuffleDependency[_, _, _])](
  ByteBuffer.wrap(taskBinary.value),
Thread.currentThread.getContextClassLoader)

However, I want to wrap this rdd by a ShuffledRDD because I need to apply a
custom partitioner to it ,and I am doing this by:

var wrappedRDD = new ShuffledRDD[_ ,_, _](rdd[_ <: Product2[Any,
Any]], context.getCustomPartitioner())

but it results in an error:

Error:unbound wildcard type rdd = new ShuffledRDD[_ ,_, _ ](rdd[_ <:
Product2[Any, Any]], context.getCustomPartitioner())
..^

The problem is that I don't know how to replace these wildcards with any
inferred type as I its supposed to be dynamic and I have no idea what would
be the inferred type of the original rdd. Any idea how I could resolved
this?


Re: How to analyze weather data in Spark?

2015-11-08 Thread Hitoshi Ozawa
There's a document describing the format of files in the parent directory. It
seems like a fixed width file.
ftp://ftp.ncdc.noaa.gov/pub/data/noaa/ish-format-document.pdf



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-analyze-weather-data-in-Spark-tp25256p25320.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: java.lang.ClassNotFoundException: org.apache.spark.streaming.twitter.TwitterReceiver

2015-11-08 Thread Sean Owen
You included a very old version of the Twitter jar - 1.0.0. Did you mean 1.5.1?

On Mon, Nov 9, 2015 at 7:36 AM, fanooos  wrote:
> This is my first Spark Stream application. The setup is as following
>
> 3 nodes running a spark cluster. One master node and two slaves.
>
> The application is a simple java application streaming from twitter and
> dependencies managed by maven.
>
> Here is the code of the application
>
> public class SimpleApp {
>
> public static void main(String[] args) {
>
> SparkConf conf = new SparkConf().setAppName("Simple
> Application").setMaster("spark://rethink-node01:7077");
>
> JavaStreamingContext sc = new JavaStreamingContext(conf, new
> Duration(1000));
>
> ConfigurationBuilder cb = new ConfigurationBuilder();
>
> cb.setDebugEnabled(true).setOAuthConsumerKey("ConsumerKey")
> .setOAuthConsumerSecret("ConsumerSecret")
> .setOAuthAccessToken("AccessToken")
> .setOAuthAccessTokenSecret("TokenSecret");
>
> OAuthAuthorization auth = new OAuthAuthorization(cb.build());
>
> JavaDStream tweets = TwitterUtils.createStream(sc, auth);
>
>  JavaDStream statuses = tweets.map(new Function String>() {
>  public String call(Status status) throws Exception {
> return status.getText();
> }
> });
>
>  statuses.print();;
>
>  sc.start();
>
>  sc.awaitTermination();
>
> }
>
> }
>
>
> here is the pom file
>
> http://maven.apache.org/POM/4.0.0;
> xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance;
> xsi:schemaLocation="http://maven.apache.org/POM/4.0.0
> http://maven.apache.org/xsd/maven-4.0.0.xsd;>
> 4.0.0
> SparkFirstTry
> SparkFirstTry
> 0.0.1-SNAPSHOT
>
> 
> 
> org.apache.spark
> spark-core_2.10
> 1.5.1
> provided
> 
>
> 
> org.apache.spark
> spark-streaming_2.10
> 1.5.1
> provided
> 
>
> 
> org.twitter4j
> twitter4j-stream
> 3.0.3
> 
> 
> org.apache.spark
> spark-streaming-twitter_2.10
> 1.0.0
> 
>
> 
>
> 
> src
> 
> 
> maven-compiler-plugin
> 3.3
> 
> 1.8
> 1.8
> 
> 
> 
> maven-assembly-plugin
> 
> 
> 
>
> com.test.sparkTest.SimpleApp
> 
> 
> 
> jar-with-dependencies
> 
> 
> 
>
> 
> 
> 
>
>
> The application starts successfully but no tweets comes and this exception
> is thrown
>
> 15/11/08 15:55:46 WARN TaskSetManager: Lost task 0.0 in stage 4.0 (TID 78,
> 192.168.122.39): java.io.IOException: java.lang.ClassNotFoundException:
> org.apache.spark.streaming.twitter.TwitterReceiver
> at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1163)
> at
> org.apache.spark.rdd.ParallelCollectionPartition.readObject(ParallelCollectionRDD.scala:70)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:497)
> at
> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1058)
> at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1900)
> at
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
> at
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
> at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
> at
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
> at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
> at
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:72)
> at
> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:98)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:194)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.ClassNotFoundException:
> 

parquet.io.ParquetEncodingException Warning when trying to save parquet file in Spark

2015-11-08 Thread swetha
Hi,

I see unwanted Warning when I try to save a Parquet file in hdfs in Spark.
Please find below the code and the Warning message. Any idea as to how to
avoid the unwanted Warning message?

activeSessionsToBeSaved.saveAsNewAPIHadoopFile("test", classOf[Void],
classOf[ActiveSession],
  classOf[ParquetOutputFormat[ActiveSession]], job.getConfiguration)

Nov 8, 2015 11:35:39 PM WARNING: parquet.hadoop.ParquetOutputCommitter:
could not write summary file for active_sessions_current
parquet.io.ParquetEncodingException:
maprfs:/user/testId/active_sessions_current/part-r-00142.parquet invalid:
all the files must be contained in the root active_sessions_current
at
parquet.hadoop.ParquetFileWriter.mergeFooters(ParquetFileWriter.java:422)
at
parquet.hadoop.ParquetFileWriter.writeMetadataFile(ParquetFileWriter.java:398)
at
parquet.hadoop.ParquetOutputCommitter.commitJob(ParquetOutputCommitter.java:51)
at
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1056)
at
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:998)



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/parquet-io-ParquetEncodingException-Warning-when-trying-to-save-parquet-file-in-Spark-tp25326.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org