Unsubscribe

2023-07-31 Thread Ali Bajwa
Unsubscribe


Re: PySpark: slicing issue with dataframes

2015-05-03 Thread Ali Bajwa
Friendly reminder on this one. Just wanted to get a confirmation that this
is not by design before I logged a JIRA

Thanks!
Ali


On Tue, Apr 28, 2015 at 9:53 AM, Ali Bajwa ali.ba...@gmail.com wrote:

 Hi experts,

 Trying to use the slicing functionality in strings as part of a Spark
 program (PySpark) I get this error:

  Code 

 import pandas as pd
 from pyspark.sql import SQLContext
 hc = SQLContext(sc)
 A = pd.DataFrame({'Firstname': ['James', 'Ali', 'Daniel'], 'Lastname':
 ['Jones', 'Bajwa', 'Day']})
 a = hc.createDataFrame(A)
 print A

 b = a.select(a.Firstname[:2])
 print b.toPandas()
 c = a.select(a.Lastname[2:])
 print c.toPandas()

 Output:

  Firstname Lastname
 0 JamesJones
 1   AliBajwa
 2Daniel  Day
   SUBSTR(Firstname, 0, 2)
 0  Ja
 1  Al
 2  Da

 ---
 Py4JError Traceback (most recent call last)
 ipython-input-17-6ee5d7d069ce in module()
  10 b = a.select(a.Firstname[:2])
  11 print b.toPandas()
 --- 12 c = a.select(a.Lastname[2:])
  13 print c.toPandas()

 /home/jupyter/spark-1.3.1/python/pyspark/sql/dataframe.pyc in substr(self,
 startPos, length)
1089 raise TypeError(Can not mix the type)
1090 if isinstance(startPos, (int, long)):
 - 1091 jc = self._jc.substr(startPos, length)
1092 elif isinstance(startPos, Column):
1093 jc = self._jc.substr(startPos._jc, length._jc)

 /home/jupyter/spark-1.3.1/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py
 in __call__(self, *args)
 536 answer = self.gateway_client.send_command(command)
 537 return_value = get_return_value(answer,
 self.gateway_client,
 -- 538 self.target_id, self.name)
 539
 540 for temp_arg in temp_args:

 /home/jupyter/spark-1.3.1/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py
 in get_return_value(answer, gateway_client, target_id, name)
 302 raise Py4JError(
 303 'An error occurred while calling {0}{1}{2}.
 Trace:\n{3}\n'.
 -- 304 format(target_id, '.', name, value))
 305 else:
 306 raise Py4JError(

 Py4JError: An error occurred while calling o1887.substr. Trace:
 py4j.Py4JException: Method substr([class java.lang.Integer, class
 java.lang.Long]) does not exist
 at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:333)
 at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:342)
 at py4j.Gateway.invoke(Gateway.java:252)
 at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
 at py4j.commands.CallCommand.execute(CallCommand.java:79)
 at py4j.GatewayConnection.run(GatewayConnection.java:207)
 at java.lang.Thread.run(Thread.java:745)

 Looks like X[:2] works but X[2:] fails with the error above
 Anyone else have this issue?

 Clearly I can use substr() to workaround this, but if this is a confirmed
 bug we should open a JIRA.

 Thanks,
 Ali



PySpark: slicing issue with dataframes

2015-04-28 Thread Ali Bajwa
Hi experts,

Trying to use the slicing functionality in strings as part of a Spark
program (PySpark) I get this error:

 Code 

import pandas as pd
from pyspark.sql import SQLContext
hc = SQLContext(sc)
A = pd.DataFrame({'Firstname': ['James', 'Ali', 'Daniel'], 'Lastname':
['Jones', 'Bajwa', 'Day']})
a = hc.createDataFrame(A)
print A

b = a.select(a.Firstname[:2])
print b.toPandas()
c = a.select(a.Lastname[2:])
print c.toPandas()

Output:

 Firstname Lastname
0 JamesJones
1   AliBajwa
2Daniel  Day
  SUBSTR(Firstname, 0, 2)
0  Ja
1  Al
2  Da

---
Py4JError Traceback (most recent call last)
ipython-input-17-6ee5d7d069ce in module()
 10 b = a.select(a.Firstname[:2])
 11 print b.toPandas()
--- 12 c = a.select(a.Lastname[2:])
 13 print c.toPandas()

/home/jupyter/spark-1.3.1/python/pyspark/sql/dataframe.pyc in substr(self,
startPos, length)
   1089 raise TypeError(Can not mix the type)
   1090 if isinstance(startPos, (int, long)):
- 1091 jc = self._jc.substr(startPos, length)
   1092 elif isinstance(startPos, Column):
   1093 jc = self._jc.substr(startPos._jc, length._jc)

/home/jupyter/spark-1.3.1/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py
in __call__(self, *args)
536 answer = self.gateway_client.send_command(command)
537 return_value = get_return_value(answer, self.gateway_client,
-- 538 self.target_id, self.name)
539
540 for temp_arg in temp_args:

/home/jupyter/spark-1.3.1/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py
in get_return_value(answer, gateway_client, target_id, name)
302 raise Py4JError(
303 'An error occurred while calling {0}{1}{2}.
Trace:\n{3}\n'.
-- 304 format(target_id, '.', name, value))
305 else:
306 raise Py4JError(

Py4JError: An error occurred while calling o1887.substr. Trace:
py4j.Py4JException: Method substr([class java.lang.Integer, class
java.lang.Long]) does not exist
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:333)
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:342)
at py4j.Gateway.invoke(Gateway.java:252)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Thread.java:745)

Looks like X[:2] works but X[2:] fails with the error above
Anyone else have this issue?

Clearly I can use substr() to workaround this, but if this is a confirmed
bug we should open a JIRA.

Thanks,
Ali


Re: Question regarding join with multiple columns with pyspark

2015-04-28 Thread Ali Bajwa
Thanks again Ayan! To close the loop on this issue, I have filed the below
JIRA to track the issue:
https://issues.apache.org/jira/browse/SPARK-7197



On Fri, Apr 24, 2015 at 8:21 PM, ayan guha guha.a...@gmail.com wrote:

 I just tested, your observation in DataFrame API is correct. It behaves
 weirdly in case of multiple column join.  (Maybe we should report a Jira?)

 Solution: You can go back to our good old composite key field
 concatenation method. Not ideal, but workaround. (Of course you can use
 realSQL as well, as shown below)

 set up Data:

 a =
 [[1993,1,100],[1993,2,200],[1994,1,1000],[1994,3,3000],[2000,1,1]]
 b = [[1993,1,A],[1994,1,AA],[2000,1,AAA]]
 YM1 = sc.parallelize(a).map(lambda tup: Row(yr=int(tup[0]),mn =
 int(tup[1]), price = int(tup[2]),joiningKey=str(tup[0])+~+str(tup[1])))
 YM2 = sc.parallelize(b).map(lambda tup: Row(yr=int(tup[0]),mn =
 int(tup[1]), name = str(tup[2]),joiningKey=str(tup[0])+~+str(tup[1])))
 print YM1.collect()
 print YM2.collect()

 YM1DF = ssc.createDataFrame(YM1)
 YM2DF = ssc.createDataFrame(YM2)

 print YM1DF.printSchema()
 print YM2DF.printSchema()

 This DOES NOT WORK ---

 YMJN = YM1DF.join(YM2DF,YM1DF.yr==YM2DF.yr and
 YM1DF.mn==YM2DF.mn,inner)
 print YMJN.printSchema()
 for l in YMJN.collect():
 print l

 Row(joiningKey=u'1993~1', mn=1, price=100, yr=1993, joiningKey=u'1993~1',
 mn=1, name=u'A', yr=1993)
 Row(joiningKey=u'1994~1', mn=1, price=100, yr=1994, joiningKey=u'1994~1',
 mn=1, name=u'AA', yr=1994)
 Row(joiningKey=u'2000~1', mn=1, price=100, yr=2000, joiningKey=u'2000~1',
 mn=1, name=u'AAA', yr=2000)
 Row(joiningKey=u'1993~1', mn=1, price=1000, yr=1993, joiningKey=u'1993~1',
 mn=1, name=u'A', yr=1993)
 Row(joiningKey=u'1994~1', mn=1, price=1000, yr=1994, joiningKey=u'1994~1',
 mn=1, name=u'AA', yr=1994)
 Row(joiningKey=u'2000~1', mn=1, price=1000, yr=2000, joiningKey=u'2000~1',
 mn=1, name=u'AAA', yr=2000)
 Row(joiningKey=u'1993~1', mn=1, price=1, yr=1993,
 joiningKey=u'1993~1', mn=1, name=u'A', yr=1993)
 Row(joiningKey=u'1994~1', mn=1, price=1, yr=1994,
 joiningKey=u'1994~1', mn=1, name=u'AA', yr=1994)
 Row(joiningKey=u'2000~1', mn=1, price=1, yr=2000,
 joiningKey=u'2000~1', mn=1, name=u'AAA', yr=2000)

 -

 SQL Solution - works as expected

 YM1DF.registerTempTable(ymdf1)
 YM2DF.registerTempTable(ymdf2)
 YMJNS = ssc.sql(select * from ymdf1 inner join ymdf2 on
 ymdf1.yr=ymdf2.yr and ymdf1.mn=ymdf2.mn)
 print YMJNS.printSchema()
 for l in YMJNS.collect():
 print l

 Row(joiningKey=u'1994~1', mn=1, price=1000, yr=1994, joiningKey=u'1994~1',
 mn=1, name=u'AA', yr=1994)
 Row(joiningKey=u'2000~1', mn=1, price=1, yr=2000,
 joiningKey=u'2000~1', mn=1, name=u'AAA', yr=2000)
 Row(joiningKey=u'1993~1', mn=1, price=100, yr=1993, joiningKey=u'1993~1',
 mn=1, name=u'A', yr=1993)

 -

 Field concat method, works as well

 YMJNA = YM1DF.join(YM2DF,YM1DF.joiningKey==YM2DF.joiningKey,inner)
 print YMJNA.printSchema()
 for l in YMJNA.collect():
 print l

 Row(joiningKey=u'1994~1', mn=1, price=1000, yr=1994, joiningKey=u'1994~1',
 mn=1, name=u'AA', yr=1994)
 Row(joiningKey=u'1993~1', mn=1, price=100, yr=1993, joiningKey=u'1993~1',
 mn=1, name=u'A', yr=1993)
 Row(joiningKey=u'2000~1', mn=1, price=1, yr=2000,
 joiningKey=u'2000~1', mn=1, name=u'AAA', yr=2000)

 On Sat, Apr 25, 2015 at 10:18 AM, Ali Bajwa ali.ba...@gmail.com wrote:

 Any ideas on this? Any sample code to join 2 data frames on two columns?

 Thanks
 Ali

 On Apr 23, 2015, at 1:05 PM, Ali Bajwa ali.ba...@gmail.com wrote:

  Hi experts,
 
  Sorry if this is a n00b question or has already been answered...
 
  Am trying to use the data frames API in python to join 2 dataframes
  with more than 1 column. The example I've seen in the documentation
  only shows a single column - so I tried this:
 
  Example code
 
  import pandas as pd
  from pyspark.sql import SQLContext
  hc = SQLContext(sc)
  A = pd.DataFrame({'year': ['1993', '2005', '1994'], 'month': ['5',
  '12', '12'], 'value': [100, 200, 300]})
  a = hc.createDataFrame(A)
  B = pd.DataFrame({'year': ['1993', '1993'], 'month': ['12', '12'],
  'value': [101, 102]})
  b = hc.createDataFrame(B)
 
  print Pandas  # try with Pandas
  print A
  print B
  print pd.merge(A, B, on=['year', 'month'], how='inner')
 
  print Spark
  print a.toPandas()
  print b.toPandas()
  print a.join(b, a.year==b.year and a.month==b.month, 'inner').toPandas()
 
 
  *Output
 
  Pandas
   month  value  year
  0 5100  1993
  112200  2005
  212300  1994
 
   month  value  year
  012101  1993
  112102  1993
 
  Empty DataFrame
 
  Columns: [month, value_x, year, value_y]
 
  Index: []
 
  Spark
   month  value  year
  0 5100  1993

Re: Question regarding join with multiple columns with pyspark

2015-04-24 Thread Ali Bajwa
Any ideas on this? Any sample code to join 2 data frames on two columns?

Thanks
Ali

On Apr 23, 2015, at 1:05 PM, Ali Bajwa ali.ba...@gmail.com wrote:

 Hi experts,

 Sorry if this is a n00b question or has already been answered...

 Am trying to use the data frames API in python to join 2 dataframes
 with more than 1 column. The example I've seen in the documentation
 only shows a single column - so I tried this:

 Example code

 import pandas as pd
 from pyspark.sql import SQLContext
 hc = SQLContext(sc)
 A = pd.DataFrame({'year': ['1993', '2005', '1994'], 'month': ['5',
 '12', '12'], 'value': [100, 200, 300]})
 a = hc.createDataFrame(A)
 B = pd.DataFrame({'year': ['1993', '1993'], 'month': ['12', '12'],
 'value': [101, 102]})
 b = hc.createDataFrame(B)

 print Pandas  # try with Pandas
 print A
 print B
 print pd.merge(A, B, on=['year', 'month'], how='inner')

 print Spark
 print a.toPandas()
 print b.toPandas()
 print a.join(b, a.year==b.year and a.month==b.month, 'inner').toPandas()


 *Output

 Pandas
  month  value  year
 0 5100  1993
 112200  2005
 212300  1994

  month  value  year
 012101  1993
 112102  1993

 Empty DataFrame

 Columns: [month, value_x, year, value_y]

 Index: []

 Spark
  month  value  year
 0 5100  1993
 112200  2005
 212300  1994

  month  value  year
 012101  1993
 112102  1993

 month  value  year month  value  year
 012200  200512102  1993
 112200  200512101  1993
 212300  199412102  1993
 312300  199412101  1993

 It looks like Spark returns some results where an inner join should
 return nothing.

 Am I doing the join with two columns in the wrong way? If yes, what is
 the right syntax for this?

 Thanks!
 Ali

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Question regarding join with multiple columns with pyspark

2015-04-23 Thread Ali Bajwa
Hi experts,

Sorry if this is a n00b question or has already been answered...

Am trying to use the data frames API in python to join 2 dataframes
with more than 1 column. The example I've seen in the documentation
only shows a single column - so I tried this:

Example code

import pandas as pd
from pyspark.sql import SQLContext
hc = SQLContext(sc)
A = pd.DataFrame({'year': ['1993', '2005', '1994'], 'month': ['5',
'12', '12'], 'value': [100, 200, 300]})
a = hc.createDataFrame(A)
B = pd.DataFrame({'year': ['1993', '1993'], 'month': ['12', '12'],
'value': [101, 102]})
b = hc.createDataFrame(B)

print Pandas  # try with Pandas
print A
print B
print pd.merge(A, B, on=['year', 'month'], how='inner')

print Spark
print a.toPandas()
print b.toPandas()
print a.join(b, a.year==b.year and a.month==b.month, 'inner').toPandas()


*Output

Pandas
  month  value  year
0 5100  1993
112200  2005
212300  1994

  month  value  year
012101  1993
112102  1993

Empty DataFrame

Columns: [month, value_x, year, value_y]

Index: []

Spark
  month  value  year
0 5100  1993
112200  2005
212300  1994

  month  value  year
012101  1993
112102  1993

 month  value  year month  value  year
012200  200512102  1993
112200  200512101  1993
212300  199412102  1993
312300  199412101  1993

It looks like Spark returns some results where an inner join should
return nothing.

Am I doing the join with two columns in the wrong way? If yes, what is
the right syntax for this?

Thanks!
Ali

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org