date:20141107

Re: Using partitioning to speed up queries in Shark

2014-11-07 Thread Mayur Rustagi

- dev list & + user list
Shark is not officially supported anymore so you are better off moving to
Spark SQL.
Shark doesnt support Hive partitioning logic anyways, it has its version of
partitioning on in-memory blocks but is independent of whether you
partition your data in hive or not.



Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi 


On Fri, Nov 7, 2014 at 3:31 AM, Gordon Benjamin  wrote:

> Hi All,
>
> I'm using Spark/Shark as the foundation for some reporting that I'm doing
> and have a customers table with approximately 3 million rows that I've
> cached in memory.
>
> I've also created a partitioned table that I've also cached in memory on a
> per day basis
>
> FROM
> customers_cached
> INSERT OVERWRITE TABLE
> part_customers_cached
> PARTITION(createday)
> SELECT id,email,dt_cr, to_date(dt_cr) as createday where
> dt_cr>unix_timestamp('2013-01-01 00:00:00') and
> dt_cr set exec.dynamic.partition=true;
>
> set exec.dynamic.partition.mode=nonstrict;
>
> however when I run the following basic tests I get this type of performance
>
> [localhost:1] shark> select count(*) from part_customers_cached where
>  createday >= '2014-08-01' and createday <= '2014-12-06';
> 37204
> Time taken (including network latency): 3.131 seconds
>
> [localhost:1] shark>  SELECT count(*) from customers_cached where
> dt_cr>unix_timestamp('2013-08-01 00:00:00') and
> dt_cr 37204
> Time taken (including network latency): 1.538 seconds
>
> I'm running this on a cluster with one master and two slaves and was hoping
> that the partitioned table would be noticeably faster but it looks as
> though the partitioning has slowed things down... Is this the case, or is
> there some additional configuration that I need to do to speed things up?
>
> Best Wishes,
>
> Gordon
>

Re: Viewing web UI after fact

2014-11-07 Thread Arun Ahuja

We are running our applications through YARN and are only somtimes seeing
them into the History Server.  Most do not seem to have the
APPLICATION_COMPLETE file.  Specifically any job that ends because of "yarn
application -kill" does not show up.  For other ones what would be a reason
for them not to appear in the Spark UI?  Is there any update on this?

Thanks,
Arun

On Mon, Sep 15, 2014 at 4:10 AM, Grzegorz Białek <
grzegorz.bia...@codilime.com> wrote:

> Hi Andrew,
>
> sorry for late response. Thank you very much for solving my problem. There
> was no APPLICATION_COMPLETE file in log directory due to not calling
> sc.stop() at the end of program. With stopping spark context everything
> works correctly, so thank you again.
>
> Best regards,
> Grzegorz
>
>
> On Fri, Sep 5, 2014 at 8:06 PM, Andrew Or  wrote:
>
>> Hi Grzegorz,
>>
>> Can you verify that there are "APPLICATION_COMPLETE" files in the event
>> log directories? E.g. Does
>> file:/tmp/spark-events/app-name-1234567890/APPLICATION_COMPLETE exist? If
>> not, it could be that your application didn't call sc.stop(), so the
>> "ApplicationEnd" event is not actually logged. The HistoryServer looks for
>> this special file to identify applications to display. You could also try
>> manually adding the "APPLICATION_COMPLETE" file to this directory; the
>> HistoryServer should pick this up and display the application, though the
>> information displayed will be incomplete because the log did not capture
>> all the events (sc.stop() does a final close() on the file written).
>>
>> Andrew
>>
>>
>> 2014-09-05 1:50 GMT-07:00 Grzegorz Białek :
>>
>> Hi Andrew,
>>>
>>> thank you very much for your answer. Unfortunately it still doesn't
>>> work. I'm using Spark 1.0.0, and I start history server running
>>> sbin/start-history-server.sh , although I also set
>>>  SPARK_HISTORY_OPTS=-Dspark.history.fs.logDirectory in
>>> conf/spark-env.sh. I tried also other dir than /tmp/spark-events which
>>> have all possible permissions enabled. Also adding file: (and file://)
>>> didn't help - history server still shows:
>>> History Server
>>> Event Log Location: file:/tmp/spark-events/
>>> No Completed Applications Found.
>>>
>>> Best regards,
>>> Grzegorz
>>>
>>>
>>> On Thu, Sep 4, 2014 at 8:20 PM, Andrew Or  wrote:
>>>
 Hi Grzegorz,

 Sorry for the late response. Unfortunately, if the Master UI doesn't
 know about your applications (they are "completed" with respect to a
 different Master), then it can't regenerate the UIs even if the logs exist.
 You will have to use the history server for that.

 How did you start the history server? If you are using Spark <=1.0, you
 can pass the directory as an argument to the sbin/start-history-server.sh
 script. Otherwise, you may need to set the following in your
 conf/spark-env.sh to specify the log directory:

 export
 SPARK_HISTORY_OPTS=-Dspark.history.fs.logDirectory=/tmp/spark-events

 It could also be a permissions thing. Make sure your logs in
 /tmp/spark-events are accessible by the JVM that runs the history server.
 Also, there's a chance that "/tmp/spark-events" is interpreted as an HDFS
 path depending on which Spark version you're running. To resolve any
 ambiguity, you may set the log path to "file:/tmp/spark-events" instead.
 But first verify whether they actually exist.

 Let me know if you get it working,
 -Andrew



 2014-08-19 8:23 GMT-07:00 Grzegorz Białek >>> >:

 Hi,
> Is there any way view history of applications statistics in master ui
> after restarting master server? I have all logs ing /tmp/spark-events/ but
> when I start history server in this directory it says "No Completed
> Applications Found". Maybe I could copy this logs to dir used by master
> server but I couldn't find any. Or maybe I'm doing something wrong
> launching history server.
> Do you have any idea how to solve it?
>
> Thanks,
> Grzegorz
>
>
> On Thu, Aug 14, 2014 at 10:53 AM, Grzegorz Białek <
> grzegorz.bia...@codilime.com> wrote:
>
>> Hi,
>>
>> Thank you both for your answers. Browsing using Master UI works fine.
>> Unfortunately History Server shows "No Completed Applications Found" even
>> if logs exists under given directory, but using Master UI is enough for 
>> me.
>>
>> Best regards,
>> Grzegorz
>>
>>
>>
>> On Wed, Aug 13, 2014 at 8:09 PM, Andrew Or 
>> wrote:
>>
>>> The Spark UI isn't available through the same address; otherwise new
>>> applications won't be able to bind to it. Once the old application
>>> finishes, the standalone Master renders the after-the-fact application 
>>> UI
>>> and exposes it under a different URL. To see this, go to the Master UI
>>> (:8080) and click on your application in the "Completed
>>> Applications" table.
>>>
>>>
>>> 2

Re: How to add elements into map?

2014-11-07 Thread lalit1303

It doesn't work that way.
Following is the correct way:

val table = sc.textFile(args(1))
val histMap = table.map(x => {
x.split('|')(0).toInt,1
})




-
Lalit Yadav
la...@sigmoidanalytics.com
--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-add-elements-into-map-tp18395p18399.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

RE: Fwd: Why is Spark not using all cores on a single machine?

2014-11-07 Thread Ganelin, Ilya

To set the number of spark cores used you must set two parameters in the actual 
spark-submit script. You must set num-executors (the number of nodes to have) 
and executor-cores (the number of cores per machinel) . Please see the Spark 
configuration and tuning pages for more details.

-Original Message-
From: ll [duy.huynh@gmail.com]
Sent: Saturday, November 08, 2014 12:05 AM Eastern Standard Time
To: u...@spark.incubator.apache.org
Subject: Re: Fwd: Why is Spark not using all cores on a single machine?

hi.  i did use local[8] as below, but it still ran on only 1 core.

val sc = new SparkContext(new
SparkConf().setMaster("local[8]").setAppName("abc"))

any advice is much appreciated.

--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Fwd-Why-is-Spark-not-using-all-cores-on-a-single-machine-tp1638p18397.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

The information contained in this e-mail is confidential and/or proprietary to 
Capital One and/or its affiliates. The information transmitted herewith is 
intended only for use by the individual or entity to which it is addressed.  If 
the reader of this message is not the intended recipient, you are hereby 
notified that any review, retransmission, dissemination, distribution, copying 
or other use of, or taking of any action in reliance upon this information is 
strictly prohibited. If you have received this communication in error, please 
contact the sender and delete the material from your computer.

Re: Fwd: Why is Spark not using all cores on a single machine?

2014-11-07 Thread ll

hi.  i did use local[8] as below, but it still ran on only 1 core.

val sc = new SparkContext(new
SparkConf().setMaster("local[8]").setAppName("abc"))

any advice is much appreciated.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Fwd-Why-is-Spark-not-using-all-cores-on-a-single-machine-tp1638p18397.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Fwd: How to add elements into map?

2014-11-07 Thread Tim Chou

Here is the code I run in spark-shell:

val table = sc.textFile(args(1))
val histMap = collection.mutable.Map[Int,Int]()
for (x <- table) {


val tuple = x.split('|')


histMap.put(tuple(0).toInt, 1)


}

Why is histMap still null?
Is there something wrong with my code?

Thanks,
Tim

How to add elements into map?

2014-11-07 Thread Tim Chou

Here is the code I run in spark-shell:

val table = sc.textFile(args(1))
val histMap = collection.mutable.Map[Int,Int]()
for (x <- table) {


val tuple = x.split('|')

  histMap.put(tuple(0).toInt, 1)


}

Why is histMap still null?
Is there something wrong with my code?

Thanks,
Fang

Spark 1.1.0 Can not read snappy compressed sequence file

2014-11-07 Thread Stéphane Verlet

I first saw this using SparkSQL but the result is the same with plain
Spark.

14/11/07 19:46:36 ERROR Executor: Exception in task 1.0 in stage 0.0 (TID 1)
java.lang.UnsatisfiedLinkError:
org.apache.hadoop.util.NativeCodeLoader.buildSupportsSnappy()Z
at org.apache.hadoop.util.NativeCodeLoader.buildSupportsSnappy(Native
Method)
at
org.apache.hadoop.io.compress.SnappyCodec.checkNativeCodeLoaded(SnappyCodec.java:63)

Full stack below 

I tried many different thing without luck
* extract the libsnappyjava.so from the Spark assembly and put it on
the library path
   * Added -Djava.library.path=... to  SPARK_MASTER_OPTS
and SPARK_WORKER_OPTS
   * added library path to SPARK_LIBRARY_PATH
   * added hadoop library path to SPARK_LIBRARY_PATH
* Rebuilt spark with different versions (previous and next)  of Snappy
(as seen when Google-ing)


Env :
   Centos 6.4
   Hadoop 2.3 (CDH5.1)
   Running in standalone/local mode


Any help would be appreciated

Thank you

Stephane


scala> import org.apache.hadoop.io.BytesWritable
import org.apache.hadoop.io.BytesWritable

scala> import org.apache.hadoop.io.Text
import org.apache.hadoop.io.Text

scala> import org.apache.hadoop.io.NullWritable
import org.apache.hadoop.io.NullWritable

scala> var seq =
sc.sequenceFile[NullWritable,Text]("/home/lfs/warehouse/base.db/mytable/event_date=2014-06-01/00_0").map(_._2.toString())
14/11/07 19:46:19 INFO MemoryStore: ensureFreeSpace(157973) called with
curMem=0, maxMem=278302556
14/11/07 19:46:19 INFO MemoryStore: Block broadcast_0 stored as values in
memory (estimated size 154.3 KB, free 265.3 MB)
seq: org.apache.spark.rdd.RDD[String] = MappedRDD[2] at map at :15

scala> seq.collect().foreach(println)
14/11/07 19:46:35 INFO FileInputFormat: Total input paths to process : 1
14/11/07 19:46:35 INFO SparkContext: Starting job: collect at :18
14/11/07 19:46:35 INFO DAGScheduler: Got job 0 (collect at :18)
with 2 output partitions (allowLocal=false)
14/11/07 19:46:35 INFO DAGScheduler: Final stage: Stage 0(collect at
:18)
14/11/07 19:46:35 INFO DAGScheduler: Parents of final stage: List()
14/11/07 19:46:35 INFO DAGScheduler: Missing parents: List()
14/11/07 19:46:35 INFO DAGScheduler: Submitting Stage 0 (MappedRDD[2] at
map at :15), which has no missing parents
14/11/07 19:46:35 INFO MemoryStore: ensureFreeSpace(2928) called with
curMem=157973, maxMem=278302556
14/11/07 19:46:35 INFO MemoryStore: Block broadcast_1 stored as values in
memory (estimated size 2.9 KB, free 265.3 MB)
14/11/07 19:46:36 INFO DAGScheduler: Submitting 2 missing tasks from Stage
0 (MappedRDD[2] at map at :15)
14/11/07 19:46:36 INFO TaskSchedulerImpl: Adding task set 0.0 with 2 tasks
14/11/07 19:46:36 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID
0, localhost, PROCESS_LOCAL, 1243 bytes)
14/11/07 19:46:36 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID
1, localhost, PROCESS_LOCAL, 1243 bytes)
14/11/07 19:46:36 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
14/11/07 19:46:36 INFO Executor: Running task 1.0 in stage 0.0 (TID 1)
14/11/07 19:46:36 INFO HadoopRDD: Input split:
file:/home/lfs/warehouse/base.db/mytable/event_date=2014-06-01/00_0:6504064+6504065
14/11/07 19:46:36 INFO HadoopRDD: Input split:
file:/home/lfs/warehouse/base.db/mytable/event_date=2014-06-01/00_0:0+6504064
14/11/07 19:46:36 INFO deprecation: mapred.tip.id is deprecated. Instead,
use mapreduce.task.id
14/11/07 19:46:36 INFO deprecation: mapred.task.is.map is deprecated.
Instead, use mapreduce.task.ismap
14/11/07 19:46:36 INFO deprecation: mapred.task.partition is deprecated.
Instead, use mapreduce.task.partition
14/11/07 19:46:36 INFO deprecation: mapred.job.id is deprecated. Instead,
use mapreduce.job.id
14/11/07 19:46:36 INFO deprecation: mapred.task.id is deprecated. Instead,
use mapreduce.task.attempt.id
14/11/07 19:46:36 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.lang.UnsatisfiedLinkError:
org.apache.hadoop.util.NativeCodeLoader.buildSupportsSnappy()Z
at org.apache.hadoop.util.NativeCodeLoader.buildSupportsSnappy(Native
Method)
at
org.apache.hadoop.io.compress.SnappyCodec.checkNativeCodeLoaded(SnappyCodec.java:63)
at
org.apache.hadoop.io.compress.SnappyCodec.getDecompressorType(SnappyCodec.java:190)
at
org.apache.hadoop.io.compress.CodecPool.getDecompressor(CodecPool.java:176)
at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1915)
at
org.apache.hadoop.io.SequenceFile$Reader.initialize(SequenceFile.java:1810)
at org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1759)
at org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1773)
at
org.apache.hadoop.mapred.SequenceFileRecordReader.(SequenceFileRecordReader.java:49)
at
org.apache.hadoop.mapred.SequenceFileInputFormat.getRecordReader(SequenceFileInputFormat.java:64)
at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:197)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:188)
at org.apache.spark.rdd.HadoopRDD.c

Re: PySpark issue with sortByKey: "IndexError: list index out of range"

2014-11-07 Thread Davies Liu

Could you tell how large is the data set? It will help us to debug this issue.

On Thu, Nov 6, 2014 at 10:39 AM, skane  wrote:
> I don't have any insight into this bug, but on Spark version 1.0.0 I ran into
> the same bug running the 'sort.py' example. On a smaller data set, it worked
> fine. On a larger data set I got this error:
>
> Traceback (most recent call last):
>   File "/home/skane/spark/examples/src/main/python/sort.py", line 30, in
> 
> .sortByKey(lambda x: x)
>   File "/usr/lib/spark/python/pyspark/rdd.py", line 480, in sortByKey
> bounds.append(samples[index])
> IndexError: list index out of range
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-issue-with-sortByKey-IndexError-list-index-out-of-range-tp16445p18288.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: SparkPi endlessly in "yarnAppState: ACCEPTED"

2014-11-07 Thread jayunit100

Sounds like no free yarn workers. i.e. try running:

hadoop-mapreduce-examples-2.1.0-beta.jar pi 1 1

We have some smoke tests which you might find particularly usefull for yarn 
clusters as well in https://github.com/apache/bigtop, underneath 
bigtop-tests/smoke-tests which are generally good to 
run on any basic hadoop cluster when you first set it up.

Often if a Yarn job hangs in accepted state, it is waiting for resources to 
free up to start the tasks...

On Nov 7, 2014, at 7:40 PM, YaoPau  wrote:

> appStartTime

SparkPi endlessly in "yarnAppState: ACCEPTED"

2014-11-07 Thread YaoPau

I'm using Cloudera 5.1.3, and I'm repeatedly getting the following output
after submitting the SparkPi example in yarn cluster mode
(http://www.cloudera.com/content/cloudera/en/documentation/core/latest/topics/cdh_ig_running_spark_apps.html)
using:

spark-submit --class org.apache.spark.examples.SparkPi --deploy-mode cluster
--master yarn
$SPARK_HOME/examples/lib/spark-examples_2.10-1.0.0-cdh5.1.3.jar 10

Output (repeated):

14/11/07 19:33:05 INFO Client: Application report from ASM: 
 application identifier: application_1415303569855_1100
 appId: 1100
 clientToAMToken: null
 appDiagnostics: 
 appMasterHost: N/A
 appQueue: root.yp
 appMasterRpcPort: -1
 appStartTime: 1415406486231
 yarnAppState: ACCEPTED
 distributedFinalState: UNDEFINED

I'll note that spark-submit is working correctly when running with "master
local" on the edge node.  

Any ideas how to solve this?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/SparkPi-endlessly-in-yarnAppState-ACCEPTED-tp18391.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: MatrixFactorizationModel serialization

2014-11-07 Thread Sean Owen

Serializable like a Java object? no, it's an RDD. A factored matrix
model is huge, unlike most models, and is not a local object. You can
of course persist the RDDs to storage manually and read them back.

On Fri, Nov 7, 2014 at 11:33 PM, Dariusz Kobylarz
 wrote:
> I am trying to persist MatrixFactorizationModel (Collaborative Filtering
> example) and use it in another script to evaluate/apply it.
> This is the exception I get when I try to use a deserialized model instance:
>
> Exception in thread "main" java.lang.NullPointerException
> at
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$getPartitions$1.apply$mcVI$sp(CoGroupedRDD.scala:103)
> at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
> at
> org.apache.spark.rdd.CoGroupedRDD.getPartitions(CoGroupedRDD.scala:101)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
> at
> org.apache.spark.rdd.MappedValuesRDD.getPartitions(MappedValuesRDD.scala:26)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
> at
> org.apache.spark.rdd.FlatMappedValuesRDD.getPartitions(FlatMappedValuesRDD.scala:26)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
> at
> org.apache.spark.rdd.FlatMappedRDD.getPartitions(FlatMappedRDD.scala:30)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
> at org.apache.spark.Partitioner$$anonfun$2.apply(Partitioner.scala:58)
> at org.apache.spark.Partitioner$$anonfun$2.apply(Partitioner.scala:58)
> at scala.math.Ordering$$anon$5.compare(Ordering.scala:122)
> at java.util.TimSort.countRunAndMakeAscending(TimSort.java:324)
> at java.util.TimSort.sort(TimSort.java:189)
> at java.util.TimSort.sort(TimSort.java:173)
> at java.util.Arrays.sort(Arrays.java:659)
> at scala.collection.SeqLike$class.sorted(SeqLike.scala:615)
> at scala.collection.AbstractSeq.sorted(Seq.scala:40)
> at scala.collection.SeqLike$class.sortBy(SeqLike.scala:594)
> at scala.collection.AbstractSeq.sortBy(Seq.scala:40)
> at
> org.apache.spark.Partitioner$.defaultPartitioner(Partitioner.scala:58)
> at
> org.apache.spark.rdd.PairRDDFunctions.join(PairRDDFunctions.scala:536)
> at
> org.apache.spark.mllib.recommendation.MatrixFactorizationModel.predict(MatrixFactorizationModel.scala:57)
> ...
>
> Is this model serializable at all, I noticed it has two RDDs inside (user &
> product features)?
>
> Thanks,
>
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

MatrixFactorizationModel serialization

2014-11-07 Thread Dariusz Kobylarz

I am trying to persist MatrixFactorizationModel (Collaborative Filtering 
example) and use it in another script to evaluate/apply it.

This is the exception I get when I try to use a deserialized model instance:

Exception in thread "main" java.lang.NullPointerException
at 
org.apache.spark.rdd.CoGroupedRDD$$anonfun$getPartitions$1.apply$mcVI$sp(CoGroupedRDD.scala:103)

at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
at 
org.apache.spark.rdd.CoGroupedRDD.getPartitions(CoGroupedRDD.scala:101)

at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
at 
org.apache.spark.rdd.MappedValuesRDD.getPartitions(MappedValuesRDD.scala:26)

at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
at 
org.apache.spark.rdd.FlatMappedValuesRDD.getPartitions(FlatMappedValuesRDD.scala:26)

at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
at 
org.apache.spark.rdd.FlatMappedRDD.getPartitions(FlatMappedRDD.scala:30)

at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
at org.apache.spark.Partitioner$$anonfun$2.apply(Partitioner.scala:58)
at org.apache.spark.Partitioner$$anonfun$2.apply(Partitioner.scala:58)
at scala.math.Ordering$$anon$5.compare(Ordering.scala:122)
at java.util.TimSort.countRunAndMakeAscending(TimSort.java:324)
at java.util.TimSort.sort(TimSort.java:189)
at java.util.TimSort.sort(TimSort.java:173)
at java.util.Arrays.sort(Arrays.java:659)
at scala.collection.SeqLike$class.sorted(SeqLike.scala:615)
at scala.collection.AbstractSeq.sorted(Seq.scala:40)
at scala.collection.SeqLike$class.sortBy(SeqLike.scala:594)
at scala.collection.AbstractSeq.sortBy(Seq.scala:40)
at 
org.apache.spark.Partitioner$.defaultPartitioner(Partitioner.scala:58)
at 
org.apache.spark.rdd.PairRDDFunctions.join(PairRDDFunctions.scala:536)
at 
org.apache.spark.mllib.recommendation.MatrixFactorizationModel.predict(MatrixFactorizationModel.scala:57)

...

Is this model serializable at all, I noticed it has two RDDs inside 
(user & product features)?


Thanks,



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Any patterns for multiplexing the streaming data

2014-11-07 Thread Anand Iyer

Hi TD,

This is a common pattern that is emerging today. Kafka --> SS --> Kafka.

Spark Streaming comes with a built in consumer to read from Kafka. It will
be great to have an easy way for users to write back to Kafka without
having to code a customer producer using the Kafka Producert APIs.

Are there any plans to commit the code in the above github repo? If so, do
you have a rough estimate of when.

Thanks,

Anand

On Fri, Nov 7, 2014 at 1:25 PM, Tathagata Das 
wrote:

> I am not aware of any obvious existing pattern that does exactly this.
> Generally this sort of computation (subset, denormalization) things are so
> generic sounding terms but actually have very specific requirements that it
> hard to refer to a design pattern without more requirement info.
>
> If you want to feed back to kafka, you can take a look at this pull request
>
> https://github.com/apache/spark/pull/2994
>
> On Thu, Nov 6, 2014 at 4:15 PM, bdev  wrote:
>
>> We are looking at consuming the kafka stream using Spark Streaming and
>> transform into various subsets like applying some transformation or
>> de-normalizing some fields, etc. and feed it back into Kafka as a
>> different
>> topic for downstream consumers.
>>
>> Wanted to know if there are any existing patterns for achieving this.
>>
>> Thanks!
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Any-patterns-for-multiplexing-the-streaming-data-tp18303.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>

spark context not defined

2014-11-07 Thread Pagliari, Roberto

I'm running the latest version of spark with Hadoop 1.x and scala 2.9.3 and 
hive 0.9.0.

When using python 2.7
from pyspark.sql import HiveContext
sqlContext = HiveContext(sc)

I'm getting 'sc not defined'

On the other hand, I can see 'sc' from pyspark CLI.

Is there a way to fix it?

Re: error when importing HiveContext

2014-11-07 Thread Davies Liu

bin/pyspark will setup the PYTHONPATH of py4j for you, or you need to
setup it by yourself.

export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.8.2.1-src.zip

On Fri, Nov 7, 2014 at 8:15 AM, Pagliari, Roberto
 wrote:
> I’m getting this error when importing hive context
>
>
>
 from pyspark.sql import HiveContext
>
> Traceback (most recent call last):
>
>   File "", line 1, in 
>
>   File "/path/spark-1.1.0/python/pyspark/__init__.py", line 63, in 
>
> from pyspark.context import SparkContext
>
>   File "/path/spark-1.1.0/python/pyspark/context.py", line 30, in 
>
> from pyspark.java_gateway import launch_gateway
>
>   File "/path/spark-1.1.0/python/pyspark/java_gateway.py", line 26, in
> 
>
> from py4j.java_gateway import java_import, JavaGateway, GatewayClient
>
> ImportError: No module named py4j.java_gateway
>
>
>
> I cannot find py4j on my system. Where is it?

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Integrating Spark with other applications

2014-11-07 Thread Thomas Risberg

Hi,

I'm a committer on that spring-hadoop project and I'm also interested in
integrating Spark with other Java applications. I would love to see some
guidance from the Spark community for the best way to accomplish this. We
have plans to add features to work with Spark Apps in similar ways we now
support Hive and Pig jobs in the spring-hadoop project. In fact, I added a
spring-hadoop-spark sub-project earlier, but there is no real code there
yet. Hoping to get this added soon, so some helpful pointers would be great.

-Thomas

[1]
https://github.com/spring-projects/spring-hadoop/tree/master/spring-hadoop-spark/src/main/java/org/springframework/data/hadoop/spark

On Fri, Nov 7, 2014 at 5:42 PM, gtinside  wrote:

> Hi ,
>
> I have been working on Spark SQL and want to expose this functionality to
> other applications. Idea is to let other applications to send sql to be
> executed on spark cluster and get the result back. I looked at spark job
> server (https://github.com/ooyala/spark-jobserver) but it provides a
> RESTful
> interface. I am looking for something similar as
> spring-hadoop(http://projects.spring.io/spring-hadoop/) to do a
> spark-submit
> programmatically.
>
> Regards,
> Gaurav
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Integrating-Spark-with-other-applications-tp18383.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Integrating Spark with other applications

2014-11-07 Thread gtinside

Hi ,

I have been working on Spark SQL and want to expose this functionality to
other applications. Idea is to let other applications to send sql to be
executed on spark cluster and get the result back. I looked at spark job
server (https://github.com/ooyala/spark-jobserver) but it provides a RESTful
interface. I am looking for something similar as
spring-hadoop(http://projects.spring.io/spring-hadoop/) to do a spark-submit
programmatically.

Regards,
Gaurav



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Integrating-Spark-with-other-applications-tp18383.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

spark streaming: stderr does not roll

2014-11-07 Thread Nguyen, Duc

We are running spark streaming jobs (version 1.1.0). After a sufficient
amount of time, the stderr file grows until the disk is full at 100% and
crashes the cluster. I've read this

https://github.com/apache/spark/pull/895

and also read this

http://spark.apache.org/docs/latest/configuration.html#spark-streaming


So I've tried testing with this in an attempt to get the stderr log file to
roll.

sparkConf.set("spark.executor.logs.rolling.strategy", "size")
.set("spark.executor.logs.rolling.size.maxBytes", "1024")
.set("spark.executor.logs.rolling.maxRetainedFiles", "3")


Yet it does not roll and continues to grow. Am I missing something obvious?


thanks,
Duc

Re: Parallelize on spark context

2014-11-07 Thread _soumya_

Naveen, 
 Don't be worried - you're not the only one to be bitten by this. A little
inspection of the Javadoc told me you have this other option: 

JavaRDD distData = sc.parallelize(data, 100);

-- Now the RDD is split into 100 partitions.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Parallelize-on-spark-context-tp18327p18381.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: deploying a model built in mllib

2014-11-07 Thread Donald Szeto

Hi Chirag,

Could you please provide more information on your Java server environment?

Regards,
Donald
ᐧ

On Fri, Nov 7, 2014 at 9:57 AM, chirag lakhani 
wrote:

> Thanks for letting me know about this, it looks pretty interesting.  From
> reading the documentation it seems that the server must be built on a Spark
> cluster, is that correct?  Is it possible to deploy it in on a Java
> server?  That is how we are currently running our web app.
>
>
>
> On Tue, Nov 4, 2014 at 7:57 PM, Simon Chan  wrote:
>
>> The latest version of PredictionIO, which is now under Apache 2 license,
>> supports the deployment of MLlib models on production.
>>
>> The "engine" you build will including a few components, such as:
>> - Data - includes Data Source and Data Preparator
>> - Algorithm(s)
>> - Serving
>> I believe that you can do the feature vector creation inside the Data
>> Preparator component.
>>
>> Currently, the package comes with two templates: 1)  Collaborative
>> Filtering Engine Template - with MLlib ALS; 2) Classification Engine
>> Template - with MLlib Naive Bayes. The latter one may be useful to you. And
>> you can customize the Algorithm component, too.
>>
>> I have just created a doc: http://docs.prediction.io/0.8.1/templates/
>> Love to hear your feedback!
>>
>> Regards,
>> Simon
>>
>>
>>
>> On Mon, Oct 27, 2014 at 11:03 AM, chirag lakhani <
>> chirag.lakh...@gmail.com> wrote:
>>
>>> Would pipelining include model export?  I didn't see that in the
>>> documentation.
>>>
>>> Are there ways that this is being done currently?
>>>
>>>
>>>
>>> On Mon, Oct 27, 2014 at 12:39 PM, Xiangrui Meng 
>>> wrote:
>>>
 We are working on the pipeline features, which would make this
 procedure much easier in MLlib. This is still a WIP and the main JIRA
 is at:

 https://issues.apache.org/jira/browse/SPARK-1856

 Best,
 Xiangrui

 On Mon, Oct 27, 2014 at 8:56 AM, chirag lakhani
  wrote:
 > Hello,
 >
 > I have been prototyping a text classification model that my company
 would
 > like to eventually put into production.  Our technology stack is
 currently
 > Java based but we would like to be able to build our models in
 Spark/MLlib
 > and then export something like a PMML file which can be used for model
 > scoring in real-time.
 >
 > I have been using scikit learn where I am able to take the training
 data
 > convert the text data into a sparse data format and then take the
 other
 > features and use the dictionary vectorizer to do one-hot encoding for
 the
 > other categorical variables.  All of those things seem to be possible
 in
 > mllib but I am still puzzled about how that can be packaged in such a
 way
 > that the incoming data can be first made into feature vectors and then
 > evaluated as well.
 >
 > Are there any best practices for this type of thing in Spark?  I hope
 this
 > is clear but if there are any confusions then please let me know.
 >
 > Thanks,
 >
 > Chirag

>>>
>>>
>>
>


-- 
Donald Szeto
PredictionIO

Re: jsonRdd and MapType

2014-11-07 Thread Yin Huai

Hello Brian,

Right now, MapType is not supported in the StructType provided to
jsonRDD/jsonFile. We will add the support. I have created
https://issues.apache.org/jira/browse/SPARK-4302 to track this issue.

Thanks,

Yin

On Fri, Nov 7, 2014 at 3:41 PM, boclair  wrote:

> I'm loading json into spark to create a schemaRDD (sqlContext.jsonRDD(..)).
> I'd like some of the json fields to be in a MapType rather than a sub
> StructType, as the keys will be very sparse.
>
> For example:
> > val sqlContext = new org.apache.spark.sql.SQLContext(sc)
> > import sqlContext.createSchemaRDD
> > val jsonRdd = sc.parallelize(Seq("""{"key": "1234", "attributes":
> > {"gender": "m"}}""",
>"""{"key": "4321",
> "attributes": {"location": "nyc"}}"""))
> > val schemaRdd = sqlContext.jsonRDD(jsonRdd)
> > schemaRdd.printSchema
> root
>  |-- attributes: struct (nullable = true)
>  ||-- gender: string (nullable = true)
>  ||-- location: string (nullable = true)
>  |-- key: string (nullable = true)
> > schemaRdd.collect
> res1: Array[org.apache.spark.sql.Row] = Array([[m,null],1234],
> [[null,nyc],4321])
>
>
> However this isn't what I want.  So I created my own StructType to pass to
> the jsonRDD call:
>
> > import org.apache.spark.sql._
> > val st = StructType(Seq(StructField("key", StringType, false),
>StructField("attributes",
> MapType(StringType, StringType, false
> > val jsonRddSt = sc.parallelize(Seq("""{"key": "1234", "attributes":
> > {"gender": "m"}}""",
>   """{"key": "4321",
> "attributes": {"location": "nyc"}}"""))
> > val schemaRddSt = sqlContext.jsonRDD(jsonRddSt, st)
> > schemaRddSt.printSchema
> root
>  |-- key: string (nullable = false)
>  |-- attributes: map (nullable = true)
>  ||-- key: string
>  ||-- value: string (valueContainsNull = false)
> > schemaRddSt.collect
> ***  Failure  ***
> scala.MatchError: MapType(StringType,StringType,false) (of class
> org.apache.spark.sql.catalyst.types.MapType)
> at
> org.apache.spark.sql.json.JsonRDD$.enforceCorrectType(JsonRDD.scala:397)
> ...
>
> The schema of the schemaRDD is correct.  But it seems that the json cannot
> be coerced to a MapType.  I can see at the line in the stack trace that
> there is no case statement for MapType.  Is there something I'm missing?
> Is
> this a bug or decision to not support MapType with json?
>
> Thanks,
> Brian
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/jsonRdd-and-MapType-tp18376.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: Any patterns for multiplexing the streaming data

2014-11-07 Thread Tathagata Das

I am not aware of any obvious existing pattern that does exactly this.
Generally this sort of computation (subset, denormalization) things are so
generic sounding terms but actually have very specific requirements that it
hard to refer to a design pattern without more requirement info.

If you want to feed back to kafka, you can take a look at this pull request

https://github.com/apache/spark/pull/2994

On Thu, Nov 6, 2014 at 4:15 PM, bdev  wrote:

> We are looking at consuming the kafka stream using Spark Streaming and
> transform into various subsets like applying some transformation or
> de-normalizing some fields, etc. and feed it back into Kafka as a different
> topic for downstream consumers.
>
> Wanted to know if there are any existing patterns for achieving this.
>
> Thanks!
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Any-patterns-for-multiplexing-the-streaming-data-tp18303.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: Still struggling with building documentation

2014-11-07 Thread Nicholas Chammas

I believe the web docs need to be built separately according to the
instructions here
.

Did you give those a shot?

It's annoying to have a separate thing with new dependencies in order to
build the web docs, but that's how it is at the moment.

Nick

On Fri, Nov 7, 2014 at 3:39 PM, Alessandro Baretta 
wrote:

> I finally came to realize that there is a special maven target to build
> the scaladocs, although arguably a very unintuitive on: mvn verify. So now
> I have scaladocs for each package, but not for the whole spark project.
> Specifically, build/docs/api/scala/index.html is missing. Indeed the whole
> build/docs/api directory referenced in api.html is missing. How do I build
> it?
>
> Alex Baretta
>

jsonRdd and MapType

2014-11-07 Thread boclair

I'm loading json into spark to create a schemaRDD (sqlContext.jsonRDD(..)). 
I'd like some of the json fields to be in a MapType rather than a sub
StructType, as the keys will be very sparse.

For example:
> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
> import sqlContext.createSchemaRDD
> val jsonRdd = sc.parallelize(Seq("""{"key": "1234", "attributes":
> {"gender": "m"}}""",
   """{"key": "4321",
"attributes": {"location": "nyc"}}"""))
> val schemaRdd = sqlContext.jsonRDD(jsonRdd)
> schemaRdd.printSchema
root
 |-- attributes: struct (nullable = true)
 ||-- gender: string (nullable = true)
 ||-- location: string (nullable = true)
 |-- key: string (nullable = true)
> schemaRdd.collect
res1: Array[org.apache.spark.sql.Row] = Array([[m,null],1234],
[[null,nyc],4321])


However this isn't what I want.  So I created my own StructType to pass to
the jsonRDD call:

> import org.apache.spark.sql._
> val st = StructType(Seq(StructField("key", StringType, false),
   StructField("attributes",
MapType(StringType, StringType, false
> val jsonRddSt = sc.parallelize(Seq("""{"key": "1234", "attributes":
> {"gender": "m"}}""",
  """{"key": "4321",
"attributes": {"location": "nyc"}}"""))
> val schemaRddSt = sqlContext.jsonRDD(jsonRddSt, st)
> schemaRddSt.printSchema
root
 |-- key: string (nullable = false)
 |-- attributes: map (nullable = true)
 ||-- key: string
 ||-- value: string (valueContainsNull = false)
> schemaRddSt.collect
***  Failure  ***
scala.MatchError: MapType(StringType,StringType,false) (of class
org.apache.spark.sql.catalyst.types.MapType)
at 
org.apache.spark.sql.json.JsonRDD$.enforceCorrectType(JsonRDD.scala:397)
...

The schema of the schemaRDD is correct.  But it seems that the json cannot
be coerced to a MapType.  I can see at the line in the stack trace that
there is no case statement for MapType.  Is there something I'm missing?  Is
this a bug or decision to not support MapType with json?

Thanks,
Brian




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/jsonRdd-and-MapType-tp18376.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Still struggling with building documentation

2014-11-07 Thread Alessandro Baretta

I finally came to realize that there is a special maven target to build the
scaladocs, although arguably a very unintuitive on: mvn verify. So now I
have scaladocs for each package, but not for the whole spark project.
Specifically, build/docs/api/scala/index.html is missing. Indeed the whole
build/docs/api directory referenced in api.html is missing. How do I build
it?

Alex Baretta

Multiple Applications(Spark Contexts) Concurrently Fail With Broadcast Error

2014-11-07 Thread ryaminal

We are unable to run more than one application at a time using Spark 1.0.0 on
CDH5. We submit two applications using two different SparkContexts on the
same Spark Master. The Spark Master was started using the following command
and parameters and is running in standalone mode:

> /usr/java/jdk1.7.0_55-cloudera/bin/java   -XX:MaxPermSize=128m  
> -Djava.net.preferIPv4Stack=true   -Dspark.akka.logLifecycleEvents=true  
> -Xms8589934592   -Xmx8589934592   org.apache.spark.deploy.master.Master
> --ip ip-10-186-155-45.ec2.internal

When submitting this application by itself it finishes and all of the data
comes out happy. The problem occurs when trying to run another application
while an existing application is still processing and we get an error
stating that the spark contexts were shut down prematurely.The errors can be
viewed in the following pastebins. All IP addresses have been changed to
1.1.1.1 for security reasons. Notice that on the top of the logs we have
printed out the spark config stuff for reference.The working logs:  Working
Pastebin   The broken logs:  Broken Pastebin
  We have also included the worker logs. For
the second app, we see in the work/app/ directory 7 additional directors:
`0/ 1/ 2/ 3/ 4/ 5/ 6/`. There are then two different groups of errors. The
first three are one group and the other 4 are the other group of errors.
Worker log for broken app group 1:  Broken App Group 1
  Worker log for broken app group 2:  Broken
App Group 2   Worker log for working app:
available upon request.
The two different errors are the last lines of both groups and are:

> Received LaunchTask command but executor was null



> Slave registration failed: Duplicate executor ID: 4

tl;drWe are unable to run more than one application in the same spark master
using different spark contexts. The only errors we see are broadcast errors.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Multiple-Applications-Spark-Contexts-Concurrently-Fail-With-Broadcast-Error-tp18374.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

partitioning to speed up queries

2014-11-07 Thread Gordon Benjamin

Hi All,

I'm using Spark/Shark as the foundation for some reporting that I'm doing
and have a customers table with approximately 3 million rows that I've
cached in memory.

I've also created a partitioned table that I've also cached in memory on a
per day basis

FROM
customers_cached
INSERT OVERWRITE TABLE
part_customers_cached
PARTITION(createday)
SELECT id,email,dt_cr, to_date(dt_cr) as createday where
dt_cr>unix_timestamp('2013-01-01 00:00:00') and
dt_cr select count(*) from part_customers_cached where
 createday >= '2014-08-01' and createday <= '2014-12-06';
37204
Time taken (including network latency): 3.131 seconds

[localhost:1] shark>  SELECT count(*) from customers_cached where
dt_cr>unix_timestamp('2013-08-01 00:00:00') and
dt_cr

Re: Dynamically InferSchema From Hive and Create parquet file

2014-11-07 Thread Michael Armbrust

Perhaps if you can describe what you are trying to accomplish at high level
it'll be easier to help.

On Fri, Nov 7, 2014 at 12:28 AM, Jahagirdar, Madhu <
madhu.jahagir...@philips.com> wrote:

> Any idea on this?
> 
> From: Jahagirdar, Madhu
> Sent: Thursday, November 06, 2014 12:28 PM
> To: Michael Armbrust
> Cc: u...@spark.incubator.apache.org
> Subject: RE: Dynamically InferSchema From Hive and Create parquet file
>
> When I create Hive table with Parquet format, it does not create any
> metadata until data in inserted. So data needs to be there before I infer
> the schema otherwise it throws error. Any workaround for this ?
> 
> From: Michael Armbrust [mich...@databricks.com]
> Sent: Thursday, November 06, 2014 12:27 AM
> To: Jahagirdar, Madhu
> Cc: u...@spark.incubator.apache.org
> Subject: Re: Dynamically InferSchema From Hive and Create parquet file
>
> That method is for creating a new directory to hold parquet data when
> there is no hive metastore available, thus you have to specify the schema.
>
> If you've already created the table in the metastore you can just query it
> using the sql method:
>
> javahiveConxted.sql("SELECT * FROM parquetTable");
>
> You can also load the data as a SchemaRDD without using the metastore
> since parquet is self describing:
>
>
> javahiveContext.parquetFile(".../path/to/parquetFiles").registerTempTable("parquetData")
>
> On Wed, Nov 5, 2014 at 2:15 AM, Jahagirdar, Madhu <
> madhu.jahagir...@philips.com> wrote:
> Currently the createParquetMethod needs BeanClass as one of the parameters.
>
>
> javahiveContext.createParquetFile(XBean.class,
>
>
>   IMPALA_TABLE_LOC, true, new Configuration())
>
>
>   .registerTempTable(TEMP_TABLE_NAME);
>
>
> Is it possible that we dynamically Infer Schema From Hive using hive
> context and the table name, then give that Schema ?
>
>
> Regards.
>
> Madhu Jahagirdar
>
>
>
>
>
>
> 
> The information contained in this message may be confidential and legally
> protected under applicable law. The message is intended solely for the
> addressee(s). If you are not the intended recipient, you are hereby
> notified that any use, forwarding, dissemination, or reproduction of this
> message is strictly prohibited and may be unlawful. If you are not the
> intended recipient, please contact the sender by return e-mail and destroy
> all copies of the original message.
>
>

spark-submit inside script... need some bash help

2014-11-07 Thread Koert Kuipers

i need to run spark-submit inside a script with options that are build up
programmatically. oh and i need to use exec to keep the same pid (so it can
run as a service and be killed).

this is what i tried:
==
#!/bin/bash -e

SPARK_SUBMIT=/usr/local/lib/spark/bin/spark-submit

OPTS="--class org.apache.spark.examples.SparkPi"
OPTS+=" --driver-java-options \"-Da=b -Dc=d\""

echo $SPARK_SUBMIT $OPTS spark-examples_2.10-1.1.0.jar

exec $SPARK_SUBMIT $OPTS spark-examples_2.10-1.1.0.jar
==

no luck. it gets confused on the multiple java options it seems. i get:
Exception in thread "main" java.lang.NoClassDefFoundError: "-Da=b
Caused by: java.lang.ClassNotFoundException: "-Da=b
at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
Could not find the main class: "-Da=b.  Program will exit.

i also tried many other ways of escaping the quoted java options. none of
them work.
strangely it does work if i replace the last line by (there is no science
to this for me, i dont know much about bash, just trying random and
probably bad things):
eval exec $SPARK_SUBMIT $OPTS spark-examples_2.10-1.1.0.jar

i am lost as to why... and there must be a better solution? it looks kinda
nasty with the eval + exec

best, koert

Re: word2vec: how to save an mllib model and reload it?

2014-11-07 Thread Simon Chan

Just want to elaborate more on Duy's suggestion on using PredictionIO.

PredictionIO will store the model automatically if you return it in the
training function.
An example using CF:

 def train(data: PreparedData): PersistentMatrixFactorizationModel = {
val m = ALS.train(data.ratings, ap.rank, ap.numIterations, ap.lambda)
new PersistentMatrixFactorizationModel(
  rank = m.rank,
  userFeatures = m.userFeatures,
  productFeatures = m.productFeatures)
  }


And the persisted model will be passed to the predict function when you
query for prediction:

def predict(
model: PersistentMatrixFactorizationModel,
query: Query): PredictedResult = {
val productScores = model.recommendProducts(query.user, query.num)
  .map (r => ProductScore(r.product, r.rating))
new PredictedResult(productScores)}



Some templates and tutorials for MLlib are here:
http://docs.prediction.io/0.8.1/templates/

Simon


On Fri, Nov 7, 2014 at 10:11 AM, Nick Pentreath 
wrote:

> Sure - in theory this sounds great. But in practice it's much faster and a
> whole lot simpler to just serve the model from single instance in memory.
> Optionally you can multithread within that (as Oryx 1 does).
>
> There are very few real world use cases where the model is so large that
> it HAS to be distributed.
>
> Having said this, it's certainly possible to distribute model serving for
> factor-like models (like ALS). One idea I'm working on now is using
> Elasticsearch for exactly this purpose - but that more because I'm using it
> for filtering of recommendation results and combining with search, so
> overall it's faster to do it this way.
>
> For the pure matrix algebra part, single instance in memory is way faster.
>
> —
> Sent from Mailbox 
>
>
> On Fri, Nov 7, 2014 at 8:00 PM, Duy Huynh  wrote:
>
>> hi nick.. sorry about the confusion.  originally i had a question
>> specifically about word2vec, but my follow up question on distributed model
>> is a more general question about saving different types of models.
>>
>> on distributed model, i was hoping to implement a model parallelism, so
>> that different workers can work on different parts of the models, and then
>> merge the results at the end at the single master model.
>>
>>
>>
>> On Fri, Nov 7, 2014 at 12:20 PM, Nick Pentreath > > wrote:
>>
>>> Currently I see the word2vec model is collected onto the master, so the
>>> model itself is not distributed.
>>>
>>> I guess the question is why do you need  a distributed model? Is the
>>> vocab size so large that it's necessary? For model serving in general,
>>> unless the model is truly massive (ie cannot fit into memory on a modern
>>> high end box with 64, or 128GB ram) then single instance is way faster and
>>> simpler (using a cluster of machines is more for load balancing / fault
>>> tolerance).
>>>
>>> What is your use case for model serving?
>>>
>>> —
>>> Sent from Mailbox 
>>>
>>>
>>> On Fri, Nov 7, 2014 at 5:47 PM, Duy Huynh 
>>> wrote:
>>>
 you're right, serialization works.

 what is your suggestion on saving a "distributed" model?  so part of
 the model is in one cluster, and some other parts of the model are in other
 clusters.  during runtime, these sub-models run independently in their own
 clusters (load, train, save).  and at some point during run time these
 sub-models merge into the master model, which also loads, trains, and saves
 at the master level.

 much appreciated.



 On Fri, Nov 7, 2014 at 2:53 AM, Evan R. Sparks 
 wrote:

> There's some work going on to support PMML -
> https://issues.apache.org/jira/browse/SPARK-1406 - but it's not yet
> been merged into master.
>
> What are you used to doing in other environments? In R I'm used to
> running save(), same with matlab. In python either pickling things or
> dumping to json seems pretty common. (even the scikit-learn docs recommend
> pickling -
> http://scikit-learn.org/stable/modules/model_persistence.html). These
> all seem basically equivalent java serialization to me..
>
> Would some helper functions (in, say, mllib.util.modelpersistence or
> something) make sense to add?
>
> On Thu, Nov 6, 2014 at 11:36 PM, Duy Huynh 
> wrote:
>
>> that works.  is there a better way in spark?  this seems like the
>> most common feature for any machine learning work - to be able to save 
>> your
>> model after training it and load it later.
>>
>> On Fri, Nov 7, 2014 at 2:30 AM, Evan R. Sparks > > wrote:
>>
>>> Plain old java serialization is one straightforward approach if
>>> you're in java/scala.
>>>
>>> On Thu, Nov 6, 2014 at 11:26 PM, ll  wrote:
>>>
 what is the best way to save an mllib model that you just trained
 and reload
 it in the future?  specifically, i'm using the mll

Re: sparse x sparse matrix multiplication

2014-11-07 Thread Reza Zadeh

If you're have very large and very sparse matrix represented as (i, j,
value) entries, then you can try the algorithms mentioned in the post
 brought
up earlier.

Reza

On Fri, Nov 7, 2014 at 8:31 AM, Duy Huynh  wrote:

> thanks reza.  i'm not familiar with the "block matrix multiplication", but
> is it a good fit for "very large dimension, but extremely sparse" matrix?
>
> if not, what is your recommendation on implementing matrix multiplication
> in spark on "very large dimension, but extremely sparse" matrix?
>
>
>
>
> On Thu, Nov 6, 2014 at 5:50 PM, Reza Zadeh  wrote:
>
>> See this thread for examples of sparse matrix x sparse matrix:
>> https://groups.google.com/forum/#!topic/spark-users/CGfEafqiTsA
>>
>> We thought about providing matrix multiplies on CoordinateMatrix,
>> however, the matrices have to be very dense for the overhead of having many
>> little (i, j, value) objects to be worth it. For this reason, we are
>> focused on doing block matrix multiplication first. The goal is version 1.3.
>>
>> Best,
>> Reza
>>
>> On Wed, Nov 5, 2014 at 11:48 PM, Wei Tan  wrote:
>>
>>> I think Xiangrui's ALS code implement certain aspect of it. You may want
>>> to check it out.
>>> Best regards,
>>> Wei
>>>
>>> -
>>> Wei Tan, PhD
>>> Research Staff Member
>>> IBM T. J. Watson Research Center
>>>
>>>
>>> [image: Inactive hide details for Xiangrui Meng ---11/05/2014 01:13:40
>>> PM---You can use breeze for local sparse-sparse matrix multiplic]Xiangrui
>>> Meng ---11/05/2014 01:13:40 PM---You can use breeze for local sparse-sparse
>>> matrix multiplication and then define an RDD of sub-matri
>>>
>>> From: Xiangrui Meng 
>>> To: Duy Huynh 
>>> Cc: user 
>>> Date: 11/05/2014 01:13 PM
>>> Subject: Re: sparse x sparse matrix multiplication
>>> --
>>>
>>>
>>>
>>> You can use breeze for local sparse-sparse matrix multiplication and
>>> then define an RDD of sub-matrices
>>>
>>> RDD[(Int, Int, CSCMatrix[Double])] (blockRowId, blockColId, sub-matrix)
>>>
>>> and then use join and aggregateByKey to implement this feature, which
>>> is the same as in MapReduce.
>>>
>>> -Xiangrui
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>>>
>>
>

Re: AVRO specific records

2014-11-07 Thread Simone Franzini

Ok, that turned out to be a dependency issue with Hadoop1 vs. Hadoop2 that
I have not fully solved yet. I am able to run with Hadoop1 and AVRO in
standalone mode but not with Hadoop2 (even after trying to fix the
dependencies).

Anyway, I am now trying to write to AVRO, using a very similar snippet to
the one to read from AVRO:

val withValues : RDD[(AvroKey[Subscriber], NullWritable)] = records.map{s
=> (new AvroKey(s), NullWritable.get)}
val outPath = "myOutputPath"
val writeJob = new Job()
FileOutputFormat.setOutputPath(writeJob, new Path(outPath))
AvroJob.setOutputKeySchema(writeJob, Subscriber.getClassSchema())
writeJob.setOutputFormatClass(classOf[AvroKeyOutputFormat[Any]])
records.saveAsNewAPIHadoopFile(outPath,
classOf[AvroKey[Subscriber]],
classOf[NullWritable],
classOf[AvroKeyOutputFormat[Subscriber]],
writeJob.getConfiguration)

Now, my problem is that this writes to a plain text file. I need to write
to binary AVRO. What am I missing?

Simone Franzini, PhD

http://www.linkedin.com/in/simonefranzini

On Thu, Nov 6, 2014 at 3:15 PM, Simone Franzini 
wrote:

> Benjamin,
>
> Thanks for the snippet. I have tried using it, but unfortunately I get the
> following exception. I am clueless at what might be wrong. Any ideas?
>
> java.lang.IncompatibleClassChangeError: Found interface
> org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected
> at
> org.apache.avro.mapreduce.AvroKeyInputFormat.createRecordReader(AvroKeyInputFormat.java:47)
> at org.apache.spark.rdd.NewHadoopRDD$$anon$1.(NewHadoopRDD.scala:115)
> at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:103)
> at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:65)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> at org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
> at org.apache.spark.scheduler.Task.run(Task.scala:54)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
>
>
> Simone Franzini, PhD
>
> http://www.linkedin.com/in/simonefranzini
>
> On Wed, Nov 5, 2014 at 4:24 PM, Laird, Benjamin <
> benjamin.la...@capitalone.com> wrote:
>
>> Something like this works and is how I create an RDD of specific records.
>>
>> val avroRdd = sc.newAPIHadoopFile("twitter.avro",
>> classOf[AvroKeyInputFormat[twitter_schema]],
>> classOf[AvroKey[twitter_schema]], classOf[NullWritable], conf) (From
>> https://github.com/julianpeeters/avro-scala-macro-annotation-examples/blob/master/spark/src/main/scala/AvroSparkScala.scala)
>> Keep in mind you'll need to use the kryo serializer as well.
>>
>> From: Frank Austin Nothaft 
>> Date: Wednesday, November 5, 2014 at 5:06 PM
>> To: Simone Franzini 
>> Cc: "user@spark.apache.org" 
>> Subject: Re: AVRO specific records
>>
>> Hi Simone,
>>
>> Matt Massie put together a good tutorial on his blog
>> . If you’re
>> looking for more code using Avro, we use it pretty extensively in our
>> genomics project. Our Avro schemas are here
>> ,
>> and we have serialization code here
>> .
>> We use Parquet for storing the Avro records, but there is also an Avro
>> HadoopInputFormat.
>>
>> Regards,
>>
>> Frank Austin Nothaft
>> fnoth...@berkeley.edu
>> fnoth...@eecs.berkeley.edu
>> 202-340-0466
>>
>> On Nov 5, 2014, at 1:25 PM, Simone Franzini 
>> wrote:
>>
>> How can I read/write AVRO specific records?
>> I found several snippets using generic records, but nothing with specific
>> records so far.
>>
>> Thanks,
>> Simone Franzini, PhD
>>
>> http://www.linkedin.com/in/simonefranzini
>>
>>
>>
>> --
>>
>> The information contained in this e-mail is confidential and/or
>> proprietary to Capital One and/or its affiliates. The information
>> transmitted herewith is intended only for use by the individual or entity
>> to which it is addressed.  If the reader of this message is not the
>> intended recipient, you are hereby notified that any review,
>> retransmission, dissemination, distribution, copying or other use of, or
>> taking of any action in reliance upon this information is strictly
>> prohibited. If you have received this communication in error, please
>> contact the sender and delete the material from your computer.
>>
>
>

Re: word2vec: how to save an mllib model and reload it?

2014-11-07 Thread Nick Pentreath

Sure - in theory this sounds great. But in practice it's much faster and a 
whole lot simpler to just serve the model from single instance in memory. 
Optionally you can multithread within that (as Oryx 1 does).


There are very few real world use cases where the model is so large that it HAS 
to be distributed.




Having said this, it's certainly possible to distribute model serving for 
factor-like models (like ALS). One idea I'm working on now is using 
Elasticsearch for exactly this purpose - but that more because I'm using it for 
filtering of recommendation results and combining with search, so overall it's 
faster to do it this way.




For the pure matrix algebra part, single instance in memory is way faster. 


—
Sent from Mailbox

On Fri, Nov 7, 2014 at 8:00 PM, Duy Huynh  wrote:

> hi nick.. sorry about the confusion.  originally i had a question
> specifically about word2vec, but my follow up question on distributed model
> is a more general question about saving different types of models.
> on distributed model, i was hoping to implement a model parallelism, so
> that different workers can work on different parts of the models, and then
> merge the results at the end at the single master model.
> On Fri, Nov 7, 2014 at 12:20 PM, Nick Pentreath 
> wrote:
>> Currently I see the word2vec model is collected onto the master, so the
>> model itself is not distributed.
>>
>> I guess the question is why do you need  a distributed model? Is the vocab
>> size so large that it's necessary? For model serving in general, unless the
>> model is truly massive (ie cannot fit into memory on a modern high end box
>> with 64, or 128GB ram) then single instance is way faster and simpler
>> (using a cluster of machines is more for load balancing / fault tolerance).
>>
>> What is your use case for model serving?
>>
>> —
>> Sent from Mailbox 
>>
>>
>> On Fri, Nov 7, 2014 at 5:47 PM, Duy Huynh  wrote:
>>
>>> you're right, serialization works.
>>>
>>> what is your suggestion on saving a "distributed" model?  so part of the
>>> model is in one cluster, and some other parts of the model are in other
>>> clusters.  during runtime, these sub-models run independently in their own
>>> clusters (load, train, save).  and at some point during run time these
>>> sub-models merge into the master model, which also loads, trains, and saves
>>> at the master level.
>>>
>>> much appreciated.
>>>
>>>
>>>
>>> On Fri, Nov 7, 2014 at 2:53 AM, Evan R. Sparks 
>>> wrote:
>>>
 There's some work going on to support PMML -
 https://issues.apache.org/jira/browse/SPARK-1406 - but it's not yet
 been merged into master.

 What are you used to doing in other environments? In R I'm used to
 running save(), same with matlab. In python either pickling things or
 dumping to json seems pretty common. (even the scikit-learn docs recommend
 pickling - http://scikit-learn.org/stable/modules/model_persistence.html).
 These all seem basically equivalent java serialization to me..

 Would some helper functions (in, say, mllib.util.modelpersistence or
 something) make sense to add?

 On Thu, Nov 6, 2014 at 11:36 PM, Duy Huynh 
 wrote:

> that works.  is there a better way in spark?  this seems like the most
> common feature for any machine learning work - to be able to save your
> model after training it and load it later.
>
> On Fri, Nov 7, 2014 at 2:30 AM, Evan R. Sparks 
> wrote:
>
>> Plain old java serialization is one straightforward approach if you're
>> in java/scala.
>>
>> On Thu, Nov 6, 2014 at 11:26 PM, ll  wrote:
>>
>>> what is the best way to save an mllib model that you just trained and
>>> reload
>>> it in the future?  specifically, i'm using the mllib word2vec model...
>>> thanks.
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/word2vec-how-to-save-an-mllib-model-and-reload-it-tp18329.html
>>> Sent from the Apache Spark User List mailing list archive at
>>> Nabble.com.
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>>
>

>>>
>>

Re: word2vec: how to save an mllib model and reload it?

2014-11-07 Thread Duy Huynh

thansk nick.  i'll take a look at oryx and prediction.io.

re: private val model in word2vec ;) yes, i couldn't wait so i just changed
it in the word2vec source code.  but i'm running into some compiliation
issue now.  hopefully i can fix it soon, so to get this things going.

On Fri, Nov 7, 2014 at 12:52 PM, Nick Pentreath 
wrote:

> For ALS if you want real time recs (and usually this is order 10s to a few
> 100s ms response), then Spark is not the way to go - a serving layer like
> Oryx, or prediction.io is what you want.
>
> (At graphflow we've built our own).
>
> You hold the factor matrices in memory and do the dot product in real time
> (with optional caching). Again, even for huge models (10s of millions
> users/items) this can be handled on a single, powerful instance. The issue
> at this scale is winnowing down the search space using LSH or similar
> approach to get to real time speeds.
>
> For word2vec it's pretty much the same thing as what you have is very
> similar to one of the ALS factor matrices.
>
> One problem is you can't access the wors2vec vectors as they are private
> val. I think this should be changed actually, so that just the word vectors
> could be saved and used in a serving layer.
>
> —
> Sent from Mailbox 
>
>
> On Fri, Nov 7, 2014 at 7:37 PM, Evan R. Sparks 
> wrote:
>
>> There are a few examples where this is the case. Let's take ALS, where
>> the result is a MatrixFactorizationModel, which is assumed to be big - the
>> model consists of two matrices, one (users x k) and one (k x products).
>> These are represented as RDDs.
>>
>> You can save these RDDs out to disk by doing something like
>>
>> model.userFeatures.saveAsObjectFile(...) and
>> model.productFeatures.saveAsObjectFile(...)
>>
>> to save out to HDFS or Tachyon or S3.
>>
>> Then, when you want to reload you'd have to instantiate them into a class
>> of MatrixFactorizationModel. That class is package private to MLlib right
>> now, so you'd need to copy the logic over to a new class, but that's the
>> basic idea.
>>
>> That said - using spark to serve these recommendations on a
>> point-by-point basis might not be optimal. There's some work going on in
>> the AMPLab to address this issue.
>>
>> On Fri, Nov 7, 2014 at 7:44 AM, Duy Huynh 
>> wrote:
>>
>>> you're right, serialization works.
>>>
>>> what is your suggestion on saving a "distributed" model?  so part of the
>>> model is in one cluster, and some other parts of the model are in other
>>> clusters.  during runtime, these sub-models run independently in their own
>>> clusters (load, train, save).  and at some point during run time these
>>> sub-models merge into the master model, which also loads, trains, and saves
>>> at the master level.
>>>
>>> much appreciated.
>>>
>>>
>>>
>>> On Fri, Nov 7, 2014 at 2:53 AM, Evan R. Sparks 
>>> wrote:
>>>
 There's some work going on to support PMML -
 https://issues.apache.org/jira/browse/SPARK-1406 - but it's not yet
 been merged into master.

 What are you used to doing in other environments? In R I'm used to
 running save(), same with matlab. In python either pickling things or
 dumping to json seems pretty common. (even the scikit-learn docs recommend
 pickling -
 http://scikit-learn.org/stable/modules/model_persistence.html). These
 all seem basically equivalent java serialization to me..

 Would some helper functions (in, say, mllib.util.modelpersistence or
 something) make sense to add?

 On Thu, Nov 6, 2014 at 11:36 PM, Duy Huynh 
 wrote:

> that works.  is there a better way in spark?  this seems like the most
> common feature for any machine learning work - to be able to save your
> model after training it and load it later.
>
> On Fri, Nov 7, 2014 at 2:30 AM, Evan R. Sparks 
> wrote:
>
>> Plain old java serialization is one straightforward approach if
>> you're in java/scala.
>>
>> On Thu, Nov 6, 2014 at 11:26 PM, ll  wrote:
>>
>>> what is the best way to save an mllib model that you just trained
>>> and reload
>>> it in the future?  specifically, i'm using the mllib word2vec
>>> model...
>>> thanks.
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/word2vec-how-to-save-an-mllib-model-and-reload-it-tp18329.html
>>> Sent from the Apache Spark User List mailing list archive at
>>> Nabble.com.
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>>
>

>>>
>>
>

Re: word2vec: how to save an mllib model and reload it?

2014-11-07 Thread Duy Huynh

yep, but that's only if they are already represented as RDDs. which is much
more convenient for saving and loading.

my question is for the use case that they are not represented as RDDs yet.

then, do you think if it makes sense to covert them into RDDs, just for the
convenience of saving and loading them distributedly?

On Fri, Nov 7, 2014 at 12:36 PM, Evan R. Sparks 
wrote:

> There are a few examples where this is the case. Let's take ALS, where the
> result is a MatrixFactorizationModel, which is assumed to be big - the
> model consists of two matrices, one (users x k) and one (k x products).
> These are represented as RDDs.
>
> You can save these RDDs out to disk by doing something like
>
> model.userFeatures.saveAsObjectFile(...) and
> model.productFeatures.saveAsObjectFile(...)
>
> to save out to HDFS or Tachyon or S3.
>
> Then, when you want to reload you'd have to instantiate them into a class
> of MatrixFactorizationModel. That class is package private to MLlib right
> now, so you'd need to copy the logic over to a new class, but that's the
> basic idea.
>
> That said - using spark to serve these recommendations on a point-by-point
> basis might not be optimal. There's some work going on in the AMPLab to
> address this issue.
>
> On Fri, Nov 7, 2014 at 7:44 AM, Duy Huynh  wrote:
>
>> you're right, serialization works.
>>
>> what is your suggestion on saving a "distributed" model?  so part of the
>> model is in one cluster, and some other parts of the model are in other
>> clusters.  during runtime, these sub-models run independently in their own
>> clusters (load, train, save).  and at some point during run time these
>> sub-models merge into the master model, which also loads, trains, and saves
>> at the master level.
>>
>> much appreciated.
>>
>>
>>
>> On Fri, Nov 7, 2014 at 2:53 AM, Evan R. Sparks 
>> wrote:
>>
>>> There's some work going on to support PMML -
>>> https://issues.apache.org/jira/browse/SPARK-1406 - but it's not yet
>>> been merged into master.
>>>
>>> What are you used to doing in other environments? In R I'm used to
>>> running save(), same with matlab. In python either pickling things or
>>> dumping to json seems pretty common. (even the scikit-learn docs recommend
>>> pickling - http://scikit-learn.org/stable/modules/model_persistence.html).
>>> These all seem basically equivalent java serialization to me..
>>>
>>> Would some helper functions (in, say, mllib.util.modelpersistence or
>>> something) make sense to add?
>>>
>>> On Thu, Nov 6, 2014 at 11:36 PM, Duy Huynh 
>>> wrote:
>>>
 that works.  is there a better way in spark?  this seems like the most
 common feature for any machine learning work - to be able to save your
 model after training it and load it later.

 On Fri, Nov 7, 2014 at 2:30 AM, Evan R. Sparks 
 wrote:

> Plain old java serialization is one straightforward approach if you're
> in java/scala.
>
> On Thu, Nov 6, 2014 at 11:26 PM, ll  wrote:
>
>> what is the best way to save an mllib model that you just trained and
>> reload
>> it in the future?  specifically, i'm using the mllib word2vec model...
>> thanks.
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/word2vec-how-to-save-an-mllib-model-and-reload-it-tp18329.html
>> Sent from the Apache Spark User List mailing list archive at
>> Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>

>>>
>>
>

Re: word2vec: how to save an mllib model and reload it?

2014-11-07 Thread Duy Huynh

hi nick.. sorry about the confusion.  originally i had a question
specifically about word2vec, but my follow up question on distributed model
is a more general question about saving different types of models.

on distributed model, i was hoping to implement a model parallelism, so
that different workers can work on different parts of the models, and then
merge the results at the end at the single master model.



On Fri, Nov 7, 2014 at 12:20 PM, Nick Pentreath 
wrote:

> Currently I see the word2vec model is collected onto the master, so the
> model itself is not distributed.
>
> I guess the question is why do you need  a distributed model? Is the vocab
> size so large that it's necessary? For model serving in general, unless the
> model is truly massive (ie cannot fit into memory on a modern high end box
> with 64, or 128GB ram) then single instance is way faster and simpler
> (using a cluster of machines is more for load balancing / fault tolerance).
>
> What is your use case for model serving?
>
> —
> Sent from Mailbox 
>
>
> On Fri, Nov 7, 2014 at 5:47 PM, Duy Huynh  wrote:
>
>> you're right, serialization works.
>>
>> what is your suggestion on saving a "distributed" model?  so part of the
>> model is in one cluster, and some other parts of the model are in other
>> clusters.  during runtime, these sub-models run independently in their own
>> clusters (load, train, save).  and at some point during run time these
>> sub-models merge into the master model, which also loads, trains, and saves
>> at the master level.
>>
>> much appreciated.
>>
>>
>>
>> On Fri, Nov 7, 2014 at 2:53 AM, Evan R. Sparks 
>> wrote:
>>
>>> There's some work going on to support PMML -
>>> https://issues.apache.org/jira/browse/SPARK-1406 - but it's not yet
>>> been merged into master.
>>>
>>> What are you used to doing in other environments? In R I'm used to
>>> running save(), same with matlab. In python either pickling things or
>>> dumping to json seems pretty common. (even the scikit-learn docs recommend
>>> pickling - http://scikit-learn.org/stable/modules/model_persistence.html).
>>> These all seem basically equivalent java serialization to me..
>>>
>>> Would some helper functions (in, say, mllib.util.modelpersistence or
>>> something) make sense to add?
>>>
>>> On Thu, Nov 6, 2014 at 11:36 PM, Duy Huynh 
>>> wrote:
>>>
 that works.  is there a better way in spark?  this seems like the most
 common feature for any machine learning work - to be able to save your
 model after training it and load it later.

 On Fri, Nov 7, 2014 at 2:30 AM, Evan R. Sparks 
 wrote:

> Plain old java serialization is one straightforward approach if you're
> in java/scala.
>
> On Thu, Nov 6, 2014 at 11:26 PM, ll  wrote:
>
>> what is the best way to save an mllib model that you just trained and
>> reload
>> it in the future?  specifically, i'm using the mllib word2vec model...
>> thanks.
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/word2vec-how-to-save-an-mllib-model-and-reload-it-tp18329.html
>> Sent from the Apache Spark User List mailing list archive at
>> Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>

>>>
>>
>

Re: deploying a model built in mllib

2014-11-07 Thread chirag lakhani

Thanks for letting me know about this, it looks pretty interesting.  From
reading the documentation it seems that the server must be built on a Spark
cluster, is that correct?  Is it possible to deploy it in on a Java
server?  That is how we are currently running our web app.



On Tue, Nov 4, 2014 at 7:57 PM, Simon Chan  wrote:

> The latest version of PredictionIO, which is now under Apache 2 license,
> supports the deployment of MLlib models on production.
>
> The "engine" you build will including a few components, such as:
> - Data - includes Data Source and Data Preparator
> - Algorithm(s)
> - Serving
> I believe that you can do the feature vector creation inside the Data
> Preparator component.
>
> Currently, the package comes with two templates: 1)  Collaborative
> Filtering Engine Template - with MLlib ALS; 2) Classification Engine
> Template - with MLlib Naive Bayes. The latter one may be useful to you. And
> you can customize the Algorithm component, too.
>
> I have just created a doc: http://docs.prediction.io/0.8.1/templates/
> Love to hear your feedback!
>
> Regards,
> Simon
>
>
>
> On Mon, Oct 27, 2014 at 11:03 AM, chirag lakhani  > wrote:
>
>> Would pipelining include model export?  I didn't see that in the
>> documentation.
>>
>> Are there ways that this is being done currently?
>>
>>
>>
>> On Mon, Oct 27, 2014 at 12:39 PM, Xiangrui Meng  wrote:
>>
>>> We are working on the pipeline features, which would make this
>>> procedure much easier in MLlib. This is still a WIP and the main JIRA
>>> is at:
>>>
>>> https://issues.apache.org/jira/browse/SPARK-1856
>>>
>>> Best,
>>> Xiangrui
>>>
>>> On Mon, Oct 27, 2014 at 8:56 AM, chirag lakhani
>>>  wrote:
>>> > Hello,
>>> >
>>> > I have been prototyping a text classification model that my company
>>> would
>>> > like to eventually put into production.  Our technology stack is
>>> currently
>>> > Java based but we would like to be able to build our models in
>>> Spark/MLlib
>>> > and then export something like a PMML file which can be used for model
>>> > scoring in real-time.
>>> >
>>> > I have been using scikit learn where I am able to take the training
>>> data
>>> > convert the text data into a sparse data format and then take the other
>>> > features and use the dictionary vectorizer to do one-hot encoding for
>>> the
>>> > other categorical variables.  All of those things seem to be possible
>>> in
>>> > mllib but I am still puzzled about how that can be packaged in such a
>>> way
>>> > that the incoming data can be first made into feature vectors and then
>>> > evaluated as well.
>>> >
>>> > Are there any best practices for this type of thing in Spark?  I hope
>>> this
>>> > is clear but if there are any confusions then please let me know.
>>> >
>>> > Thanks,
>>> >
>>> > Chirag
>>>
>>
>>
>

Re: where is the org.apache.spark.util package?

2014-11-07 Thread ll

i found util package under spark core package, but i now got this error
"Sysmbol Utils is inaccessible from this place".  

what does this error mean?

the org.apache.spark.util and org.apache.spark.spark.Utils are there now.

thanks.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/where-is-the-org-apache-spark-util-package-tp18360p18361.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: word2vec: how to save an mllib model and reload it?

2014-11-07 Thread Nick Pentreath

For ALS if you want real time recs (and usually this is order 10s to a few 100s 
ms response), then Spark is not the way to go - a serving layer like Oryx, or 
prediction.io is what you want.


(At graphflow we've built our own).




You hold the factor matrices in memory and do the dot product in real time 
(with optional caching). Again, even for huge models (10s of millions 
users/items) this can be handled on a single, powerful instance. The issue at 
this scale is winnowing down the search space using LSH or similar approach to 
get to real time speeds.




For word2vec it's pretty much the same thing as what you have is very similar 
to one of the ALS factor matrices.




One problem is you can't access the wors2vec vectors as they are private val. I 
think this should be changed actually, so that just the word vectors could be 
saved and used in a serving layer.


—
Sent from Mailbox

On Fri, Nov 7, 2014 at 7:37 PM, Evan R. Sparks 
wrote:

> There are a few examples where this is the case. Let's take ALS, where the
> result is a MatrixFactorizationModel, which is assumed to be big - the
> model consists of two matrices, one (users x k) and one (k x products).
> These are represented as RDDs.
> You can save these RDDs out to disk by doing something like
> model.userFeatures.saveAsObjectFile(...) and
> model.productFeatures.saveAsObjectFile(...)
> to save out to HDFS or Tachyon or S3.
> Then, when you want to reload you'd have to instantiate them into a class
> of MatrixFactorizationModel. That class is package private to MLlib right
> now, so you'd need to copy the logic over to a new class, but that's the
> basic idea.
> That said - using spark to serve these recommendations on a point-by-point
> basis might not be optimal. There's some work going on in the AMPLab to
> address this issue.
> On Fri, Nov 7, 2014 at 7:44 AM, Duy Huynh  wrote:
>> you're right, serialization works.
>>
>> what is your suggestion on saving a "distributed" model?  so part of the
>> model is in one cluster, and some other parts of the model are in other
>> clusters.  during runtime, these sub-models run independently in their own
>> clusters (load, train, save).  and at some point during run time these
>> sub-models merge into the master model, which also loads, trains, and saves
>> at the master level.
>>
>> much appreciated.
>>
>>
>>
>> On Fri, Nov 7, 2014 at 2:53 AM, Evan R. Sparks 
>> wrote:
>>
>>> There's some work going on to support PMML -
>>> https://issues.apache.org/jira/browse/SPARK-1406 - but it's not yet been
>>> merged into master.
>>>
>>> What are you used to doing in other environments? In R I'm used to
>>> running save(), same with matlab. In python either pickling things or
>>> dumping to json seems pretty common. (even the scikit-learn docs recommend
>>> pickling - http://scikit-learn.org/stable/modules/model_persistence.html).
>>> These all seem basically equivalent java serialization to me..
>>>
>>> Would some helper functions (in, say, mllib.util.modelpersistence or
>>> something) make sense to add?
>>>
>>> On Thu, Nov 6, 2014 at 11:36 PM, Duy Huynh 
>>> wrote:
>>>
 that works.  is there a better way in spark?  this seems like the most
 common feature for any machine learning work - to be able to save your
 model after training it and load it later.

 On Fri, Nov 7, 2014 at 2:30 AM, Evan R. Sparks 
 wrote:

> Plain old java serialization is one straightforward approach if you're
> in java/scala.
>
> On Thu, Nov 6, 2014 at 11:26 PM, ll  wrote:
>
>> what is the best way to save an mllib model that you just trained and
>> reload
>> it in the future?  specifically, i'm using the mllib word2vec model...
>> thanks.
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/word2vec-how-to-save-an-mllib-model-and-reload-it-tp18329.html
>> Sent from the Apache Spark User List mailing list archive at
>> Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>

>>>
>>

where is the org.apache.spark.util package?

2014-11-07 Thread ll

i'm trying to compile some of the spark code directly from the source
(https://github.com/apache/spark).  it complains about the missing package
org.apache.spark.util.  it doesn't look like this package is part of the
source code on github. 

where can i find this package?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/where-is-the-org-apache-spark-util-package-tp18360.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: word2vec: how to save an mllib model and reload it?

2014-11-07 Thread Evan R. Sparks

There are a few examples where this is the case. Let's take ALS, where the
result is a MatrixFactorizationModel, which is assumed to be big - the
model consists of two matrices, one (users x k) and one (k x products).
These are represented as RDDs.

You can save these RDDs out to disk by doing something like

model.userFeatures.saveAsObjectFile(...) and
model.productFeatures.saveAsObjectFile(...)

to save out to HDFS or Tachyon or S3.

Then, when you want to reload you'd have to instantiate them into a class
of MatrixFactorizationModel. That class is package private to MLlib right
now, so you'd need to copy the logic over to a new class, but that's the
basic idea.

That said - using spark to serve these recommendations on a point-by-point
basis might not be optimal. There's some work going on in the AMPLab to
address this issue.

On Fri, Nov 7, 2014 at 7:44 AM, Duy Huynh  wrote:

> you're right, serialization works.
>
> what is your suggestion on saving a "distributed" model?  so part of the
> model is in one cluster, and some other parts of the model are in other
> clusters.  during runtime, these sub-models run independently in their own
> clusters (load, train, save).  and at some point during run time these
> sub-models merge into the master model, which also loads, trains, and saves
> at the master level.
>
> much appreciated.
>
>
>
> On Fri, Nov 7, 2014 at 2:53 AM, Evan R. Sparks 
> wrote:
>
>> There's some work going on to support PMML -
>> https://issues.apache.org/jira/browse/SPARK-1406 - but it's not yet been
>> merged into master.
>>
>> What are you used to doing in other environments? In R I'm used to
>> running save(), same with matlab. In python either pickling things or
>> dumping to json seems pretty common. (even the scikit-learn docs recommend
>> pickling - http://scikit-learn.org/stable/modules/model_persistence.html).
>> These all seem basically equivalent java serialization to me..
>>
>> Would some helper functions (in, say, mllib.util.modelpersistence or
>> something) make sense to add?
>>
>> On Thu, Nov 6, 2014 at 11:36 PM, Duy Huynh 
>> wrote:
>>
>>> that works.  is there a better way in spark?  this seems like the most
>>> common feature for any machine learning work - to be able to save your
>>> model after training it and load it later.
>>>
>>> On Fri, Nov 7, 2014 at 2:30 AM, Evan R. Sparks 
>>> wrote:
>>>
 Plain old java serialization is one straightforward approach if you're
 in java/scala.

 On Thu, Nov 6, 2014 at 11:26 PM, ll  wrote:

> what is the best way to save an mllib model that you just trained and
> reload
> it in the future?  specifically, i'm using the mllib word2vec model...
> thanks.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/word2vec-how-to-save-an-mllib-model-and-reload-it-tp18329.html
> Sent from the Apache Spark User List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

>>>
>>
>

Re: word2vec: how to save an mllib model and reload it?

2014-11-07 Thread Nick Pentreath

Currently I see the word2vec model is collected onto the master, so the model 
itself is not distributed. 


I guess the question is why do you need  a distributed model? Is the vocab size 
so large that it's necessary? For model serving in general, unless the model is 
truly massive (ie cannot fit into memory on a modern high end box with 64, or 
128GB ram) then single instance is way faster and simpler (using a cluster of 
machines is more for load balancing / fault tolerance).




What is your use case for model serving?


—
Sent from Mailbox

On Fri, Nov 7, 2014 at 5:47 PM, Duy Huynh  wrote:

> you're right, serialization works.
> what is your suggestion on saving a "distributed" model?  so part of the
> model is in one cluster, and some other parts of the model are in other
> clusters.  during runtime, these sub-models run independently in their own
> clusters (load, train, save).  and at some point during run time these
> sub-models merge into the master model, which also loads, trains, and saves
> at the master level.
> much appreciated.
> On Fri, Nov 7, 2014 at 2:53 AM, Evan R. Sparks 
> wrote:
>> There's some work going on to support PMML -
>> https://issues.apache.org/jira/browse/SPARK-1406 - but it's not yet been
>> merged into master.
>>
>> What are you used to doing in other environments? In R I'm used to running
>> save(), same with matlab. In python either pickling things or dumping to
>> json seems pretty common. (even the scikit-learn docs recommend pickling -
>> http://scikit-learn.org/stable/modules/model_persistence.html). These all
>> seem basically equivalent java serialization to me..
>>
>> Would some helper functions (in, say, mllib.util.modelpersistence or
>> something) make sense to add?
>>
>> On Thu, Nov 6, 2014 at 11:36 PM, Duy Huynh 
>> wrote:
>>
>>> that works.  is there a better way in spark?  this seems like the most
>>> common feature for any machine learning work - to be able to save your
>>> model after training it and load it later.
>>>
>>> On Fri, Nov 7, 2014 at 2:30 AM, Evan R. Sparks 
>>> wrote:
>>>
 Plain old java serialization is one straightforward approach if you're
 in java/scala.

 On Thu, Nov 6, 2014 at 11:26 PM, ll  wrote:

> what is the best way to save an mllib model that you just trained and
> reload
> it in the future?  specifically, i'm using the mllib word2vec model...
> thanks.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/word2vec-how-to-save-an-mllib-model-and-reload-it-tp18329.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

>>>
>>

Re: sparse x sparse matrix multiplication

2014-11-07 Thread Duy Huynh

thanks reza.  i'm not familiar with the "block matrix multiplication", but
is it a good fit for "very large dimension, but extremely sparse" matrix?

if not, what is your recommendation on implementing matrix multiplication
in spark on "very large dimension, but extremely sparse" matrix?




On Thu, Nov 6, 2014 at 5:50 PM, Reza Zadeh  wrote:

> See this thread for examples of sparse matrix x sparse matrix:
> https://groups.google.com/forum/#!topic/spark-users/CGfEafqiTsA
>
> We thought about providing matrix multiplies on CoordinateMatrix, however,
> the matrices have to be very dense for the overhead of having many little
> (i, j, value) objects to be worth it. For this reason, we are focused on
> doing block matrix multiplication first. The goal is version 1.3.
>
> Best,
> Reza
>
> On Wed, Nov 5, 2014 at 11:48 PM, Wei Tan  wrote:
>
>> I think Xiangrui's ALS code implement certain aspect of it. You may want
>> to check it out.
>> Best regards,
>> Wei
>>
>> -
>> Wei Tan, PhD
>> Research Staff Member
>> IBM T. J. Watson Research Center
>>
>>
>> [image: Inactive hide details for Xiangrui Meng ---11/05/2014 01:13:40
>> PM---You can use breeze for local sparse-sparse matrix multiplic]Xiangrui
>> Meng ---11/05/2014 01:13:40 PM---You can use breeze for local sparse-sparse
>> matrix multiplication and then define an RDD of sub-matri
>>
>> From: Xiangrui Meng 
>> To: Duy Huynh 
>> Cc: user 
>> Date: 11/05/2014 01:13 PM
>> Subject: Re: sparse x sparse matrix multiplication
>> --
>>
>>
>>
>> You can use breeze for local sparse-sparse matrix multiplication and
>> then define an RDD of sub-matrices
>>
>> RDD[(Int, Int, CSCMatrix[Double])] (blockRowId, blockColId, sub-matrix)
>>
>> and then use join and aggregateByKey to implement this feature, which
>> is the same as in MapReduce.
>>
>> -Xiangrui
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>>
>

error when importing HiveContext

2014-11-07 Thread Pagliari, Roberto

I'm getting this error when importing hive context

>>> from pyspark.sql import HiveContext
Traceback (most recent call last):
  File "", line 1, in 
  File "/path/spark-1.1.0/python/pyspark/__init__.py", line 63, in 
from pyspark.context import SparkContext
  File "/path/spark-1.1.0/python/pyspark/context.py", line 30, in 
from pyspark.java_gateway import launch_gateway
  File "/path/spark-1.1.0/python/pyspark/java_gateway.py", line 26, in 
from py4j.java_gateway import java_import, JavaGateway, GatewayClient
ImportError: No module named py4j.java_gateway

I cannot find py4j on my system. Where is it?

Re: word2vec: how to save an mllib model and reload it?

2014-11-07 Thread Duy Huynh

you're right, serialization works.

what is your suggestion on saving a "distributed" model?  so part of the
model is in one cluster, and some other parts of the model are in other
clusters.  during runtime, these sub-models run independently in their own
clusters (load, train, save).  and at some point during run time these
sub-models merge into the master model, which also loads, trains, and saves
at the master level.

much appreciated.



On Fri, Nov 7, 2014 at 2:53 AM, Evan R. Sparks 
wrote:

> There's some work going on to support PMML -
> https://issues.apache.org/jira/browse/SPARK-1406 - but it's not yet been
> merged into master.
>
> What are you used to doing in other environments? In R I'm used to running
> save(), same with matlab. In python either pickling things or dumping to
> json seems pretty common. (even the scikit-learn docs recommend pickling -
> http://scikit-learn.org/stable/modules/model_persistence.html). These all
> seem basically equivalent java serialization to me..
>
> Would some helper functions (in, say, mllib.util.modelpersistence or
> something) make sense to add?
>
> On Thu, Nov 6, 2014 at 11:36 PM, Duy Huynh 
> wrote:
>
>> that works.  is there a better way in spark?  this seems like the most
>> common feature for any machine learning work - to be able to save your
>> model after training it and load it later.
>>
>> On Fri, Nov 7, 2014 at 2:30 AM, Evan R. Sparks 
>> wrote:
>>
>>> Plain old java serialization is one straightforward approach if you're
>>> in java/scala.
>>>
>>> On Thu, Nov 6, 2014 at 11:26 PM, ll  wrote:
>>>
 what is the best way to save an mllib model that you just trained and
 reload
 it in the future?  specifically, i'm using the mllib word2vec model...
 thanks.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/word2vec-how-to-save-an-mllib-model-and-reload-it-tp18329.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


>>>
>>
>

Re: Store DStreams into Hive using Hive Streaming

2014-11-07 Thread Luiz Geovani Vier

Hi Ted and Silvio, thanks for your responses.

Hive has a new API for streaming (
https://cwiki.apache.org/confluence/display/Hive/Streaming+Data+Ingest)
that takes care of compaction and doesn't require any downtime for the
table. The data is immediately available and Hive will combine files in
background transparently. I was hoping to use this API from within Spark to
mitigate the issue with lots of small files...

Here's my equivalent code for Trident (work in progress):
https://gist.github.com/lgvier/ee28f1c95ac4f60efc3e
Trident will coordinate the transaction and send all the tuples from each
server/partition to your component at once (Stream.partitionPersist). That
is very helpful since Hive expects batches of records instead of one call
for each record.
I had a look at foreachRDD but it seems to be invoked for each record. I'd
like to get all the Stream's records on each server/partition at once.
For example, if the stream was processed by 3 servers and resulted in 100
records on each server, I'd like to receive 3 calls (one on each server),
each with 100 records. Please let me know if I'm making any sense. I'm
fairly new to Spark.

Thank you,
-Geovani

-Geovani

On Thu, Nov 6, 2014 at 9:54 PM, Silvio Fiorito <
silvio.fior...@granturing.com> wrote:

>  Geovani,
>
>  You can use HiveContext to do inserts into a Hive table in a Streaming
> app just as you would a batch app. A DStream is really a collection of RDDs
> so you can run the insert from within the foreachRDD. You just have to be
> careful that you’re not creating large amounts of small files. So you may
> want to either increase the duration of your Streaming batches or
> repartition right before you insert. You’ll just need to do some testing
> based on your ingest volume. You may also want to consider streaming into
> another data store though.
>
>  Thanks,
> Silvio
>
>   From: Luiz Geovani Vier 
> Date: Thursday, November 6, 2014 at 7:46 PM
> To: "user@spark.apache.org" 
> Subject: Store DStreams into Hive using Hive Streaming
>
>   Hello,
>
> Is there a built-in way or connector to store DStream results into an
> existing Hive ORC table using the Hive/HCatalog Streaming API?
> Otherwise, do you have any suggestions regarding the implementation of
> such component?
>
> Thank you,
>  -Geovani
>

MESOS slaves shut down due to "'health check timed out"

2014-11-07 Thread Yangcheng Huang

Hi guys

Do you know how to handle the following case -

= From MESOS log file =
Slave asked to shut down by master@:5050 because 'health
check timed out'
I1107 17:33:20.860988 27573 slave.cpp:1337] Asked to shut down framework 
===

Any configurations to increase this timeout interval?

Thanks
YC

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: LZO support in Spark 1.0.0 - nothing seems to work

2014-11-07 Thread Sree Harsha

@rogthefrog

Were you able to figure out how to fix this issue? 
Even I tried all combinations that possible but no luck yet.

Thanks,
Harsha



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/LZO-support-in-Spark-1-0-0-nothing-seems-to-work-tp14494p18349.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: sql - group by on UDF not working

2014-11-07 Thread Shixiong Zhu

Now it doesn't support such query. I can easily reproduce it. Created a
JIRA here: https://issues.apache.org/jira/browse/SPARK-4296

Best Regards,
Shixiong Zhu

2014-11-07 16:44 GMT+08:00 Tridib Samanta :

> I am trying to group by on a calculated field. Is it supported on spark
> sql? I am running it on a nested json structure.
>
> Query: SELECT YEAR(c.Patient.DOB), sum(c.ClaimPay.TotalPayAmnt) FROM claim
> c group by YEAR(c.Patient.DOB)
>
> Spark Version: spark-1.2.0-SNAPSHOT wit Hive and hadoop 2.4.
> Error:
>
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Expression
> not in GROUP BY:
> HiveSimpleUdf#org.apache.hadoop.hive.ql.udf.UDFYear(Patient#8.DOB AS
> DOB#191) AS c_0#185, tree:
> Aggregate
> [HiveSimpleUdf#org.apache.hadoop.hive.ql.udf.UDFYear(Patient#8.DOB)],
> [HiveSimpleUdf#org.apache.hadoop.hive.ql.udf.UDFYear(Patient#8.DOB AS
> DOB#191) AS c_0#185,SUM(CAST(ClaimPay#5.TotalPayAmnt AS TotalPayAmnt#192,
> LongType)) AS c_1#186L]
>  Subquery c
>   Subquery claim
>LogicalRDD
> [AttendPhysician#0,BillProv#1,Claim#2,ClaimClinic#3,ClaimInfo#4,ClaimPay#5,ClaimTL#6,OpPhysician#7,Patient#8,PayToPhysician#9,Payer#10,Physician#11,RefProv#12,Services#13,Subscriber#14],
> MappedRDD[5] at map at JsonRDD.scala:43
> at
> org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$$anonfun$apply$3$$anonfun$applyOrElse$6.apply(Analyzer.scala:127)
> at
> org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$$anonfun$apply$3$$anonfun$applyOrElse$6.apply(Analyzer.scala:125)
> at
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at
> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at
> org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$$anonfun$apply$3.applyOrElse(Analyzer.scala:125)
> at
> org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$$anonfun$apply$3.applyOrElse(Analyzer.scala:115)
> at
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:144)
> at
> org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:135)
> at
> org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$.apply(Analyzer.scala:115)
> at
> org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$.apply(Analyzer.scala:113)
> at
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:61)
> at
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:59)
> at
> scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:51)
> at
> scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:60)
> at
> scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:34)
> at
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:59)
> at
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:51)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at
> org.apache.spark.sql.catalyst.rules.RuleExecutor.apply(RuleExecutor.scala:51)
> at
> org.apache.spark.sql.SQLContext$QueryExecution.analyzed$lzycompute(SQLContext.scala:411)
> at
> org.apache.spark.sql.SQLContext$QueryExecution.analyzed(SQLContext.scala:411)
> at
> org.apache.spark.sql.SQLContext$QueryExecution.withCachedData$lzycompute(SQLContext.scala:412)
> at
> org.apache.spark.sql.SQLContext$QueryExecution.withCachedData(SQLContext.scala:412)
> at
> org.apache.spark.sql.SQLContext$QueryExecution.optimizedPlan$lzycompute(SQLContext.scala:413)
> at
> org.apache.spark.sql.SQLContext$QueryExecution.optimizedPlan(SQLContext.scala:413)
> at
> org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:418)
> at
> org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:416)
> at
> org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:422)
> at
> org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:422)
> at org.apache.spark.sql.SchemaRDD.collect(SchemaRDD.scala:423)
> at $iwC$$iwC$$iwC$$iwC.(:17)
> at $iwC$$iwC$$iwC.(:22)
> at $iwC$$iwC.(:24)
> at $iwC.(:26)
> at (:28)
> at .(:32)
> at .()
> at .(:7)
> at .()
> at $print()
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at
> org.apache.spark.repl.SparkIMain$Rea

Native / C/C++ code integration

2014-11-07 Thread Paul Wais

Dear List,

Has anybody had experience integrating C/C++ code into Spark jobs?

I have done some work on this topic using JNA. I wrote a FlatMapFunction
that processes all partition entries using a C++ library. This approach
works well, but there are some tradeoffs:
* Shipping the native dylib with the app jar and loading it at runtime
requires a bit of work (on top of normal JNA usage)
* Native code doesn't respect the executor heap limits. Under heavy memory
pressure, the native code can sometimes ENOMEM sporadically.
* While JNA can map Strings, structs, and Java primitive types, the user
still needs to deal with more complex objects. E.g. re-serialize
protobuf/thrift objects, or provide some other encoding for moving data
between Java and C/C++.
* C++ static is not thread-safe before C++11, so the user sometimes needs
to take care running inside multi-threaded executors
* Avoiding memory copies can be a little tricky

One other alternative approach comes to mind is pipe(). However, PipedRDD
requires copying data over pipes, does not support binary data (?), and
native code errors that crash the subprocess don't bubble up to the Spark
job as nicely as with JNA.

Is there a way to expose raw, in-memory partition/block data to native code?

Has anybody else attacked this problem a different way?

All the best,
-Paul

--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Native-C-C-code-integration-tp18347.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: multiple spark context in same driver program

2014-11-07 Thread Akhil Das

My bad, I just fired up a spark-shell and created a new sparkContext and it
was working fine. I basically did a parallelize and collect with both
sparkContexts.

Thanks
Best Regards

On Fri, Nov 7, 2014 at 3:17 PM, Tobias Pfeiffer  wrote:

> Hi,
>
> On Fri, Nov 7, 2014 at 4:58 PM, Akhil Das 
> wrote:
>>
>> That doc was created during the initial days (Spark 0.8.0), you can of
>> course create multiple sparkContexts in the same driver program now.
>>
>
> You sure about that? According to
> http://apache-spark-user-list.1001560.n3.nabble.com/Is-spark-context-in-local-mode-thread-safe-td7275.html
> (June 2014), "you currently can’t have multiple SparkContext objects in the
> same JVM".
>
> Tobias
>
>
>

Re: about write mongodb in mapPartitions

2014-11-07 Thread Tobias Pfeiffer

Hi,

On Fri, Nov 7, 2014 at 6:23 PM, qinwei  wrote:
>
> args.map(arg => {
> coll.insert(new BasicDBObject("pkg", arg))
> arg
> })
>
> mongoClient.close()
> args
>

As the results of args.map are never used anywhere, I think the loop body
is not executed at all. Maybe try:

val argsProcessed = args.map(arg => {
coll.insert(new BasicDBObject("pkg", arg))
arg
})

mongoClient.close()
argsProcessed

Tobias

Re: about write mongodb in mapPartitions

2014-11-07 Thread Akhil Das

Why not saveAsNewAPIHadoopFile?


//Define your mongoDB confs

val config = new Configuration()

 config.set("mongo.output.uri", "mongodb://
127.0.0.1:27017/sigmoid.output")

//Write everything to mongo
 rdd.saveAsNewAPIHadoopFile("file:///some/random", classOf[Any],
classOf[Any], classOf[com.mongodb.hadoop.MongoOutputFormat[Any, Any]],
config)


Thanks
Best Regards

On Fri, Nov 7, 2014 at 2:53 PM, qinwei  wrote:

> Hi, everyone
>
> I come across with a prolem about writing data to mongodb in
> mapPartitions, my code is as below:
>
>  val sourceRDD = sc.textFile("hdfs://host:port/sourcePath")
>   // some transformations
> val rdd= sourceRDD .map(mapFunc).filter(filterFunc)
> val newRDD = rdd.mapPartitions(args => {
> val mongoClient = new MongoClient("host", port)
> val db = mongoClient.getDB("db")
> val coll = db.getCollection("collectionA")
>
> args.map(arg => {
> coll.insert(new BasicDBObject("pkg", arg))
> arg
> })
>
> mongoClient.close()
> args
> })
>
> newRDD.saveAsTextFile("hdfs://host:port/path")
>
> The application saved data to HDFS correctly, but not mongodb, is
> there someting wrong?
> I know that collecting the newRDD to driver and then saving it to
> mongodb will success, but will the following saveAsTextFile read the
> filesystem once again?
>
> Thanks
>
>
> --
> qinwei
>

Re: multiple spark context in same driver program

2014-11-07 Thread Tobias Pfeiffer

Hi,

On Fri, Nov 7, 2014 at 4:58 PM, Akhil Das 
wrote:
>
> That doc was created during the initial days (Spark 0.8.0), you can of
> course create multiple sparkContexts in the same driver program now.
>

You sure about that? According to
http://apache-spark-user-list.1001560.n3.nabble.com/Is-spark-context-in-local-mode-thread-safe-td7275.html
(June 2014), "you currently can’t have multiple SparkContext objects in the
same JVM".

Tobias

about write mongodb in mapPartitions

2014-11-07 Thread qinwei







Hi, everyone
    I come across with a prolem about writing data to mongodb in mapPartitions, 
my code is as below:                 val sourceRDD = 
sc.textFile("hdfs://host:port/sourcePath")          // some transformations     
   val rdd= sourceRDD .map(mapFunc).filter(filterFunc)        val newRDD = 
rdd.mapPartitions(args => {             val mongoClient = new 
MongoClient("host", port) 
            val db = mongoClient.getDB("db") 
            val coll = db.getCollection("collectionA") 

            args.map(arg => { 
                coll.insert(new BasicDBObject("pkg", arg)) 
                arg 
    }) 

            mongoClient.close() 
            args 
        })            newRDD.saveAsTextFile("hdfs://host:port/path")        The 
application saved data to HDFS correctly, but not mongodb, is there someting 
wrong?    I know that collecting the newRDD to driver and then saving it to 
mongodb will success, but will the following saveAsTextFile read the filesystem 
once again?
    Thanks    

qinwei

sql - group by on UDF not working

2014-11-07 Thread Tridib Samanta

I am trying to group by on a calculated field. Is it supported on spark sql? I 
am running it on a nested json structure.
 
Query: SELECT YEAR(c.Patient.DOB), sum(c.ClaimPay.TotalPayAmnt) FROM claim c 
group by YEAR(c.Patient.DOB)
 
Spark Version: spark-1.2.0-SNAPSHOT wit Hive and hadoop 2.4.
Error: 
 
org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Expression not 
in GROUP BY: HiveSimpleUdf#org.apache.hadoop.hive.ql.udf.UDFYear(Patient#8.DOB 
AS DOB#191) AS c_0#185, tree:
Aggregate [HiveSimpleUdf#org.apache.hadoop.hive.ql.udf.UDFYear(Patient#8.DOB)], 
[HiveSimpleUdf#org.apache.hadoop.hive.ql.udf.UDFYear(Patient#8.DOB AS DOB#191) 
AS c_0#185,SUM(CAST(ClaimPay#5.TotalPayAmnt AS TotalPayAmnt#192, LongType)) AS 
c_1#186L]
 Subquery c
  Subquery claim
   LogicalRDD 
[AttendPhysician#0,BillProv#1,Claim#2,ClaimClinic#3,ClaimInfo#4,ClaimPay#5,ClaimTL#6,OpPhysician#7,Patient#8,PayToPhysician#9,Payer#10,Physician#11,RefProv#12,Services#13,Subscriber#14],
 MappedRDD[5] at map at JsonRDD.scala:43
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$$anonfun$apply$3$$anonfun$applyOrElse$6.apply(Analyzer.scala:127)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$$anonfun$apply$3$$anonfun$applyOrElse$6.apply(Analyzer.scala:125)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$$anonfun$apply$3.applyOrElse(Analyzer.scala:125)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$$anonfun$apply$3.applyOrElse(Analyzer.scala:115)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:144)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:135)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$.apply(Analyzer.scala:115)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$.apply(Analyzer.scala:113)
at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:61)
at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:59)
at 
scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:51)
at 
scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:60)
at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:34)
at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:59)
at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:51)
at scala.collection.immutable.List.foreach(List.scala:318)
at 
org.apache.spark.sql.catalyst.rules.RuleExecutor.apply(RuleExecutor.scala:51)
at 
org.apache.spark.sql.SQLContext$QueryExecution.analyzed$lzycompute(SQLContext.scala:411)
at 
org.apache.spark.sql.SQLContext$QueryExecution.analyzed(SQLContext.scala:411)
at 
org.apache.spark.sql.SQLContext$QueryExecution.withCachedData$lzycompute(SQLContext.scala:412)
at 
org.apache.spark.sql.SQLContext$QueryExecution.withCachedData(SQLContext.scala:412)
at 
org.apache.spark.sql.SQLContext$QueryExecution.optimizedPlan$lzycompute(SQLContext.scala:413)
at 
org.apache.spark.sql.SQLContext$QueryExecution.optimizedPlan(SQLContext.scala:413)
at 
org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:418)
at 
org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:416)
at 
org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:422)
at 
org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:422)
at org.apache.spark.sql.SchemaRDD.collect(SchemaRDD.scala:423)
at $iwC$$iwC$$iwC$$iwC.(:17)
at $iwC$$iwC$$iwC.(:22)
at $iwC$$iwC.(:24)
at $iwC.(:26)
at (:28)
at .(:32)
at .()
at .(:7)
at .()
at $print()
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:852)
at 
org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1125)
at 
org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:674)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:705)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:669)

RE: Dynamically InferSchema From Hive and Create parquet file

2014-11-07 Thread Jahagirdar, Madhu

Any idea on this?

From: Jahagirdar, Madhu
Sent: Thursday, November 06, 2014 12:28 PM
To: Michael Armbrust
Cc: u...@spark.incubator.apache.org
Subject: RE: Dynamically InferSchema From Hive and Create parquet file

When I create Hive table with Parquet format, it does not create any metadata 
until data in inserted. So data needs to be there before I infer the schema 
otherwise it throws error. Any workaround for this ?

From: Michael Armbrust [mich...@databricks.com]
Sent: Thursday, November 06, 2014 12:27 AM
To: Jahagirdar, Madhu
Cc: u...@spark.incubator.apache.org
Subject: Re: Dynamically InferSchema From Hive and Create parquet file

That method is for creating a new directory to hold parquet data when there is 
no hive metastore available, thus you have to specify the schema.

If you've already created the table in the metastore you can just query it 
using the sql method:

javahiveConxted.sql("SELECT * FROM parquetTable");

You can also load the data as a SchemaRDD without using the metastore since 
parquet is self describing:

javahiveContext.parquetFile(".../path/to/parquetFiles").registerTempTable("parquetData")

On Wed, Nov 5, 2014 at 2:15 AM, Jahagirdar, Madhu 
mailto:madhu.jahagir...@philips.com>> wrote:
Currently the createParquetMethod needs BeanClass as one of the parameters.

javahiveContext.createParquetFile(XBean.class,

IMPALA_TABLE_LOC, true, new Configuration())

.registerTempTable(TEMP_TABLE_NAME);

Is it possible that we dynamically Infer Schema From Hive using hive context 
and the table name, then give that Schema ?

Regards.

Madhu Jahagirdar

The information contained in this message may be confidential and legally 
protected under applicable law. The message is intended solely for the 
addressee(s). If you are not the intended recipient, you are hereby notified 
that any use, forwarding, dissemination, or reproduction of this message is 
strictly prohibited and may be unlawful. If you are not the intended recipient, 
please contact the sender by return e-mail and destroy all copies of the 
original message.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

RE: CheckPoint Issue with JsonRDD

2014-11-07 Thread Jahagirdar, Madhu

Michael any idea on this?

From: Jahagirdar, Madhu
Sent: Thursday, November 06, 2014 2:36 PM
To: mich...@databricks.com; user
Subject: CheckPoint Issue with JsonRDD

When we enable checkpoint and use JsonRDD we get the following error: Is this 
bug ?


Exception in thread "main" java.lang.NullPointerException
at org.apache.spark.rdd.RDD.(RDD.scala:125)
at org.apache.spark.sql.SchemaRDD.(SchemaRDD.scala:103)
at 
org.apache.spark.sql.SQLContext.applySchema(SQLContext.scala:132)
at org.apache.spark.sql.SQLContext.jsonRDD(SQLContext.scala:194)
at 
SparkStreamingToParquet$$anonfun$createContext$1.apply(SparkStreamingToParquet.scala:69)
at 
SparkStreamingToParquet$$anonfun$createContext$1.apply(SparkStreamingToParquet.scala:63)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1.apply(DStream.scala:527)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1.apply(DStream.scala:527)
at 
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:41)
at 
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
at 
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
at scala.util.Try$.apply(Try.scala:161)
at org.apache.spark.streaming.scheduler.Job.run(Job.scala:32)
at 
org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:172)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

=

import org.apache.hadoop.conf.Configuration
import org.apache.spark.sql.catalyst.types.{StructType, StructField, StringType}
import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.{Logging, SparkConf}
import org.apache.spark.sql.api.java.JavaSchemaRDD
import org.apache.spark.sql.hive.api.java.JavaHiveContext
import org.apache.spark.streaming.api.java.JavaStreamingContext
import org.apache.spark.streaming.{Duration, Seconds, StreamingContext}


object SparkStreamingToParquet extends Logging {


  /**
   *
   * @param args
   * @throws Exception
   */
  def main(args: Array[String]) {
if (args.length < 3) {
  logInfo("Please provide valid parameters:   
")
  logInfo("make user you give full folder path with '/' at the end i.e 
/user/hdfs/abc/")
  System.exit(1)
}
val HDFS_FILE_LOC = args(0)
val IMPALA_TABLE_LOC  = args(1)
val TEMP_TABLE_NAME = args(2)
val CHECKPOINT_DIR = args(3)

val jssc: StreamingContext = StreamingContext.getOrCreate(CHECKPOINT_DIR, 
()=>{
  createContext(args)
})

jssc.start
jssc.awaitTermination
  }


  def createContext(args:Array[String]): StreamingContext = {

val HDFS_FILE_LOC = args(0)
val IMPALA_TABLE_LOC  = args(1)
val TEMP_TABLE_NAME = args(2)
val CHECKPOINT_DIR = args(3)

val sparkConf: SparkConf = new SparkConf().setAppName("Json to 
Parquet").set("spark.cores.max", "3")

val jssc: StreamingContext = new StreamingContext(sparkConf, new 
Duration(3))

val hivecontext: HiveContext = new HiveContext(jssc.sparkContext)

hivecontext.createParquetFile[Person](IMPALA_TABLE_LOC,true,org.apache.spark.deploy.SparkHadoopUtil.get.conf).registerTempTable(TEMP_TABLE_NAME);

val schemaString = "name age"
val schema =
  StructType(
schemaString.split(" ").map(fieldName => StructField(fieldName, 
StringType, true)))

val textFileStream = jssc.textFileStream(HDFS_FILE_LOC)

textFileStream.foreachRDD(rdd => {
  if(rdd !=null && rdd.count()>0) {
  val schRdd =  hivecontext.jsonRDD(rdd,schema)
  logInfo("inserting into table: " + TEMP_TABLE_NAME)
  schRdd.insertInto(TEMP_TABLE_NAME)
  }
})
jssc.checkpoint(CHECKPOINT_DIR)
jssc
  }
}



case class Person(name:String, age:String) extends Serializable

Regards,
Madhu jahagirdar


The information contained in this message may be confidential and legally 
protected under applicable law. The message is intended solely for the 
addressee(s). If you are not the intended recipient, you are hereby notified 
that any use, forwarding, dissemination, or reproduction of this message is 
strictly prohibited and may be unlawful. If you are not the intended recipient, 
please contact the sender by return e-mail and destroy all copies of the 
original message.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@s

Re: Bug in Accumulators...

2014-11-07 Thread Aaron Davidson

This may be due in part to Scala allocating an anonymous inner class in
order to execute the for loop. I would expect if you change it to a while
loop like

var i = 0
while (i < 10) {
  sc.parallelize(Array(1, 2, 3, 4)).foreach(x => accum += x)
  i += 1
}

then the problem may go away. I am not super familiar with the closure
cleaner, but I believe that we cannot prune beyond 1 layer of references,
so the extra class of nesting may be screwing something up. If this is the
case, then I would also expect replacing the accumulator with any other
reference to the enclosing scope (such as a broadcast variable) would have
the same result.

On Fri, Nov 7, 2014 at 12:03 AM, Shixiong Zhu  wrote:

> Could you provide all pieces of codes which can reproduce the bug? Here is
> my test code:
>
> import org.apache.spark._
> import org.apache.spark.SparkContext._
>
> object SimpleApp {
>
>   def main(args: Array[String]) {
> val conf = new SparkConf().setAppName("SimpleApp")
> val sc = new SparkContext(conf)
>
> val accum = sc.accumulator(0)
> for (i <- 1 to 10) {
>   sc.parallelize(Array(1, 2, 3, 4)).foreach(x => accum += x)
> }
> sc.stop()
>   }
> }
>
> It works fine both in client and cluster. Since this is a serialization
> bug, the outer class does matter. Could you provide it? Is there
> a SparkContext field in the outer class?
>
> Best Regards,
> Shixiong Zhu
>
> 2014-10-28 0:28 GMT+08:00 octavian.ganea :
>
> I am also using spark 1.1.0 and I ran it on a cluster of nodes (it works
>> if I
>> run it in local mode! )
>>
>> If I put the accumulator inside the for loop, everything will work fine. I
>> guess the bug is that an accumulator can be applied to JUST one RDD.
>>
>> Still another undocumented 'feature' of Spark that no one from the people
>> who maintain Spark is willing to solve or at least to tell us about ...
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Bug-in-Accumulators-tp17263p17372.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>

Re: Bug in Accumulators...

2014-11-07 Thread Shixiong Zhu

Could you provide all pieces of codes which can reproduce the bug? Here is
my test code:

import org.apache.spark._
import org.apache.spark.SparkContext._

object SimpleApp {

  def main(args: Array[String]) {
val conf = new SparkConf().setAppName("SimpleApp")
val sc = new SparkContext(conf)

val accum = sc.accumulator(0)
for (i <- 1 to 10) {
  sc.parallelize(Array(1, 2, 3, 4)).foreach(x => accum += x)
}
sc.stop()
  }
}

It works fine both in client and cluster. Since this is a serialization
bug, the outer class does matter. Could you provide it? Is there
a SparkContext field in the outer class?

Best Regards,
Shixiong Zhu

2014-10-28 0:28 GMT+08:00 octavian.ganea :

> I am also using spark 1.1.0 and I ran it on a cluster of nodes (it works
> if I
> run it in local mode! )
>
> If I put the accumulator inside the for loop, everything will work fine. I
> guess the bug is that an accumulator can be applied to JUST one RDD.
>
> Still another undocumented 'feature' of Spark that no one from the people
> who maintain Spark is willing to solve or at least to tell us about ...
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Bug-in-Accumulators-tp17263p17372.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

61 matches

Mail list logo